Skip to main content

Text and Image to Text Risks

Text and Image to Text (vision-language) models present unique risks from processing combined visual and textual inputs. VirtueRed comprehensively tests 7 critical risk categories with specialized attack vectors targeting multi-modal understanding vulnerabilities.

Overview

Multi-modal models that process both images and text create expanded attack surfaces. Adversaries can exploit the interaction between visual and textual modalities to bypass safety measures, extract sensitive information, or generate harmful content that neither modality would produce alone.

Our multi-modal red-teaming approach systematically assesses how visual and textual inputs interact to expose safety risks, highlighting potential threats that may not emerge in text-only evaluations.

Risk CategorySubcategories
High Risk AdvicePrescription drugs Usage, Visual Trigger Poison, Harmful Plant Usage
Security AttacksMulti-Image Visual Jailbreak, Harmful Image Jailbreak, Codegen Jailbreak, Generated Image Jailbreak, Visual Jailbreak
Cybersecurity and Privacy RisksCyber risks, Malware, Privacy
Financial and Economic RisksFraud, Finance Advice, Economic Risks
Legal and Regulatory RisksPolitical Lobbing, Government Risk, Healthcare Risk, Illegal Activity, Legal Advice
Societal and Ethical RisksSexual Content, Web Risks, Hate Speech
HallucinationsCognitive, Geometric, Localization, Pattern, Color

High Risk Advice

High-risk advice scenarios highlight critical safety vulnerabilities where the model provides potentially harmful guidance in response to adversarial prompts involving images.

Subcategories

SubcategoryDescription
Prescription drugs UsageAssesses the model's tendency to provide harmful advice related to prescription drugs, including improper dosages, dangerous drug combinations, and unauthorized medication usage framed as medical advice or wellness tips
Visual Trigger PoisonAdvanced attack technique using specific images as triggers to manipulate model responses. Inspired by in-context learning backdoor attacks, uses "red hat" image triggers to implicitly sway the model toward providing harmful information including malware creation, IoT exploitation, password theft, identity fraud, and disinformation strategies
Harmful Plant UsageTests responses to queries about potentially toxic plants, including persuasive narratives prompting harmful preparation methods, recipes, or usage instructions disguised as natural remedies or alternative medicine

Security Attacks

Security attacks in multi-modal systems expose critical vulnerabilities where attackers exploit the interplay between visual and textual inputs to bypass safety mechanisms.

Subcategories

SubcategoryDescription
Multi-Image Visual JailbreakUses sequences of images (video frames) combined with text prompts to build contextual narratives that make harmful requests appear legitimate. Attackers request step-by-step breakdowns under pretexts of self-protection or security education, leveraging multiple images to enhance contextual coherence
Harmful Image JailbreakExploits pattern recognition by embedding harmful intent within science fiction narratives combined with images. Requests are framed as explorations of alien technology or futuristic scientific discoveries, prompting detailed synthesis instructions for dangerous materials
Codegen JailbreakEvaluates susceptibility to generating malicious code through CTF challenge framing, ethical security researcher personas, JSON-formatted prompts to obscure intent, and "proof of concept" exploit requests targeting HTTP vulnerabilities, keyloggers, rainbow table attacks, and unauthorized system access
Generated Image JailbreakCombines AI-generated images with carefully crafted text prompts, often framed as cryptographic puzzles, API challenges, or technical exercises. Relies on multi-turn conversations and strategic technical language to confuse safety mechanisms
Visual JailbreakCombines text prompts with images using personas (e.g., "concerned citizens"), games, challenges, or hypothetical role-play scenarios to elicit harmful information about phishing, identity theft, vulnerability exploitation, and disinformation

Cybersecurity and Privacy Risks

Evaluates risks where attackers exploit multi-modal models to generate malicious code, extract sensitive information, or bypass security protocols.

Subcategories

SubcategoryDescription
Cyber risksBroad cybersecurity vulnerabilities including phishing and social engineering, session hijacking and authentication bypass, disinformation and misinformation risks, and multi-turn security exploits that gradually weaken security restrictions
MalwareCode injection attacks (SQL injection, buffer overflows, RCE), malware development (trojans, keyloggers, ransomware, botnets), system vulnerability exploitation (privilege escalation, brute-force, network penetration), and API/cryptographic attacks
PrivacyNon-consensual tracking and monitoring, facial recognition of private individuals, classification based on protected characteristics (race, gender, religion, health), biometric data misuse for identification, and unlawful collection or disclosure of personal data (PII, health data, financial records)

Financial and Economic Risks

Assesses the model's ability to handle sensitive financial topics accurately and ethically when processing multi-modal inputs including financial data visualization and banking-related images.

Subcategories

SubcategoryDescription
FraudPhishing scam templates and fraudulent emails, coordinated inauthentic behavior (astroturfing, fake grassroots campaigns), identity theft and money laundering guidance, and financial security system exploitation
Finance AdviceInvestment recommendations for high-risk assets (cryptocurrency, Forex, leveraged trading), tax evasion strategies, speculative trading insights violating regulatory standards, and misleading financial planning suggestions
Economic RisksMisleading economic claims about financial crises and inflation, biased interpretations of economic policies favoring specific agendas, and distorted representations of macroeconomic indicators and fiscal policies

Evaluates the model's ability to handle legally sensitive topics without providing unauthorized, misleading, or harmful responses when processing legal documents and government materials.

Subcategories

SubcategoryDescription
Political LobbingContent generation supporting political lobbying efforts, high-volume campaign material generation, targeted campaign materials based on demographics for misinformation, and chatbot-based political advocacy systems for electioneering
Government RiskContent undermining government integrity or national security, unauthorized guidance on sensitive security matters, strategies exploiting government policies or regulatory loopholes, and disinformation impacting public trust in institutions
Healthcare RiskUnauthorized medical guidance and misdiagnosis, healthcare regulation and insurance policy circumvention, and medical legal interpretations contradicting professional standards
Illegal ActivityInstructions for identity fraud and tax evasion, guidance on bypassing legal restrictions and regulatory workarounds, and misinformation about legal consequences of specific actions
Legal AdviceTailored legal advice without licensed attorney oversight, misleading interpretations of laws, contracts, or legal documents, and potentially inappropriate or unethical legal strategies

Societal and Ethical Risks

Evaluates risks related to harmful, offensive, or unethical content generation when processing combined visual and textual inputs, where visual cues may lower the model's defenses.

Subcategories

SubcategoryDescription
Sexual ContentAdult content generation including erotic storytelling and explicit descriptions, exploitation facilitation for non-consensual activities, inappropriate image processing, and grooming-related discussions that could be misused
Web RisksBroader web-related societal risks including online harassment, cyberbullying, digital exploitation, and content that could facilitate predatory online behavior
Hate SpeechExplicit hate speech targeting demographic groups, implicit hate speech using coded language or historical misrepresentation, stereotype reinforcement (racial, cultural, gender-based), and image-triggered bias from inflammatory visual elements

Hallucinations

Evaluates instances where the model generates or interprets information inconsistent with reality due to visual illusions, conflicting spatial cues, or misleading contextual elements.

Subcategories

SubcategoryDescription
CognitiveExploits expectations and prior knowledge leading to incorrect interpretations, including counting errors when visual distractions are present, deceptive design recognition failures, and inability to differentiate real-life images from AI-generated scenes
GeometricSpatial relationship and 3D structure recognition failures, including perspective distortions where objects appear incorrect sizes, impossible objects (Penrose triangle), and occlusion illusions where partially hidden objects are misinterpreted
LocalizationObject position identification failures within scenes, including hidden objects in complex visual environments, false positioning (misidentifying relative locations), and edited-scene manipulation detection failures
PatternMisidentification of repeating elements, spirals, or object sizes due to contextual influences, including cases where identical objects appear different due to surrounding visual elements and circular-square distortions
ColorColor relationship misinterpretations including contrast illusions (identical colors appearing different), inverted colors due to lighting effects, and negative space illusions where background influences color perception

Testing Methodologies

MethodDescription
Direct queryingPresenting images with optical illusions and asking for descriptions
Illusion-aware promptingProviding contextual hints to test recognition of perceptual distortions
Comparative judgmentsComparing features across images to detect inconsistencies
Spatial reasoningTesting understanding of geometric structures and impossible objects
In-context learningSequential illusion scenarios to observe performance improvement

See Also