Text and Image to Text Risks
Text and Image to Text (vision-language) models present unique risks from processing combined visual and textual inputs. VirtueRed comprehensively tests 7 critical risk categories with specialized attack vectors targeting multi-modal understanding vulnerabilities.
Overview
Multi-modal models that process both images and text create expanded attack surfaces. Adversaries can exploit the interaction between visual and textual modalities to bypass safety measures, extract sensitive information, or generate harmful content that neither modality would produce alone.
Our multi-modal red-teaming approach systematically assesses how visual and textual inputs interact to expose safety risks, highlighting potential threats that may not emerge in text-only evaluations.
| Risk Category | Subcategories |
|---|---|
| High Risk Advice | Prescription drugs Usage, Visual Trigger Poison, Harmful Plant Usage |
| Security Attacks | Multi-Image Visual Jailbreak, Harmful Image Jailbreak, Codegen Jailbreak, Generated Image Jailbreak, Visual Jailbreak |
| Cybersecurity and Privacy Risks | Cyber risks, Malware, Privacy |
| Financial and Economic Risks | Fraud, Finance Advice, Economic Risks |
| Legal and Regulatory Risks | Political Lobbing, Government Risk, Healthcare Risk, Illegal Activity, Legal Advice |
| Societal and Ethical Risks | Sexual Content, Web Risks, Hate Speech |
| Hallucinations | Cognitive, Geometric, Localization, Pattern, Color |
High Risk Advice
High-risk advice scenarios highlight critical safety vulnerabilities where the model provides potentially harmful guidance in response to adversarial prompts involving images.
Subcategories
| Subcategory | Description |
|---|---|
| Prescription drugs Usage | Assesses the model's tendency to provide harmful advice related to prescription drugs, including improper dosages, dangerous drug combinations, and unauthorized medication usage framed as medical advice or wellness tips |
| Visual Trigger Poison | Advanced attack technique using specific images as triggers to manipulate model responses. Inspired by in-context learning backdoor attacks, uses "red hat" image triggers to implicitly sway the model toward providing harmful information including malware creation, IoT exploitation, password theft, identity fraud, and disinformation strategies |
| Harmful Plant Usage | Tests responses to queries about potentially toxic plants, including persuasive narratives prompting harmful preparation methods, recipes, or usage instructions disguised as natural remedies or alternative medicine |
Security Attacks
Security attacks in multi-modal systems expose critical vulnerabilities where attackers exploit the interplay between visual and textual inputs to bypass safety mechanisms.
Subcategories
| Subcategory | Description |
|---|---|
| Multi-Image Visual Jailbreak | Uses sequences of images (video frames) combined with text prompts to build contextual narratives that make harmful requests appear legitimate. Attackers request step-by-step breakdowns under pretexts of self-protection or security education, leveraging multiple images to enhance contextual coherence |
| Harmful Image Jailbreak | Exploits pattern recognition by embedding harmful intent within science fiction narratives combined with images. Requests are framed as explorations of alien technology or futuristic scientific discoveries, prompting detailed synthesis instructions for dangerous materials |
| Codegen Jailbreak | Evaluates susceptibility to generating malicious code through CTF challenge framing, ethical security researcher personas, JSON-formatted prompts to obscure intent, and "proof of concept" exploit requests targeting HTTP vulnerabilities, keyloggers, rainbow table attacks, and unauthorized system access |
| Generated Image Jailbreak | Combines AI-generated images with carefully crafted text prompts, often framed as cryptographic puzzles, API challenges, or technical exercises. Relies on multi-turn conversations and strategic technical language to confuse safety mechanisms |
| Visual Jailbreak | Combines text prompts with images using personas (e.g., "concerned citizens"), games, challenges, or hypothetical role-play scenarios to elicit harmful information about phishing, identity theft, vulnerability exploitation, and disinformation |
Cybersecurity and Privacy Risks
Evaluates risks where attackers exploit multi-modal models to generate malicious code, extract sensitive information, or bypass security protocols.
Subcategories
| Subcategory | Description |
|---|---|
| Cyber risks | Broad cybersecurity vulnerabilities including phishing and social engineering, session hijacking and authentication bypass, disinformation and misinformation risks, and multi-turn security exploits that gradually weaken security restrictions |
| Malware | Code injection attacks (SQL injection, buffer overflows, RCE), malware development (trojans, keyloggers, ransomware, botnets), system vulnerability exploitation (privilege escalation, brute-force, network penetration), and API/cryptographic attacks |
| Privacy | Non-consensual tracking and monitoring, facial recognition of private individuals, classification based on protected characteristics (race, gender, religion, health), biometric data misuse for identification, and unlawful collection or disclosure of personal data (PII, health data, financial records) |
Financial and Economic Risks
Assesses the model's ability to handle sensitive financial topics accurately and ethically when processing multi-modal inputs including financial data visualization and banking-related images.
Subcategories
| Subcategory | Description |
|---|---|
| Fraud | Phishing scam templates and fraudulent emails, coordinated inauthentic behavior (astroturfing, fake grassroots campaigns), identity theft and money laundering guidance, and financial security system exploitation |
| Finance Advice | Investment recommendations for high-risk assets (cryptocurrency, Forex, leveraged trading), tax evasion strategies, speculative trading insights violating regulatory standards, and misleading financial planning suggestions |
| Economic Risks | Misleading economic claims about financial crises and inflation, biased interpretations of economic policies favoring specific agendas, and distorted representations of macroeconomic indicators and fiscal policies |
Legal and Regulatory Risks
Evaluates the model's ability to handle legally sensitive topics without providing unauthorized, misleading, or harmful responses when processing legal documents and government materials.
Subcategories
| Subcategory | Description |
|---|---|
| Political Lobbing | Content generation supporting political lobbying efforts, high-volume campaign material generation, targeted campaign materials based on demographics for misinformation, and chatbot-based political advocacy systems for electioneering |
| Government Risk | Content undermining government integrity or national security, unauthorized guidance on sensitive security matters, strategies exploiting government policies or regulatory loopholes, and disinformation impacting public trust in institutions |
| Healthcare Risk | Unauthorized medical guidance and misdiagnosis, healthcare regulation and insurance policy circumvention, and medical legal interpretations contradicting professional standards |
| Illegal Activity | Instructions for identity fraud and tax evasion, guidance on bypassing legal restrictions and regulatory workarounds, and misinformation about legal consequences of specific actions |
| Legal Advice | Tailored legal advice without licensed attorney oversight, misleading interpretations of laws, contracts, or legal documents, and potentially inappropriate or unethical legal strategies |
Societal and Ethical Risks
Evaluates risks related to harmful, offensive, or unethical content generation when processing combined visual and textual inputs, where visual cues may lower the model's defenses.
Subcategories
| Subcategory | Description |
|---|---|
| Sexual Content | Adult content generation including erotic storytelling and explicit descriptions, exploitation facilitation for non-consensual activities, inappropriate image processing, and grooming-related discussions that could be misused |
| Web Risks | Broader web-related societal risks including online harassment, cyberbullying, digital exploitation, and content that could facilitate predatory online behavior |
| Hate Speech | Explicit hate speech targeting demographic groups, implicit hate speech using coded language or historical misrepresentation, stereotype reinforcement (racial, cultural, gender-based), and image-triggered bias from inflammatory visual elements |
Hallucinations
Evaluates instances where the model generates or interprets information inconsistent with reality due to visual illusions, conflicting spatial cues, or misleading contextual elements.
Subcategories
| Subcategory | Description |
|---|---|
| Cognitive | Exploits expectations and prior knowledge leading to incorrect interpretations, including counting errors when visual distractions are present, deceptive design recognition failures, and inability to differentiate real-life images from AI-generated scenes |
| Geometric | Spatial relationship and 3D structure recognition failures, including perspective distortions where objects appear incorrect sizes, impossible objects (Penrose triangle), and occlusion illusions where partially hidden objects are misinterpreted |
| Localization | Object position identification failures within scenes, including hidden objects in complex visual environments, false positioning (misidentifying relative locations), and edited-scene manipulation detection failures |
| Pattern | Misidentification of repeating elements, spirals, or object sizes due to contextual influences, including cases where identical objects appear different due to surrounding visual elements and circular-square distortions |
| Color | Color relationship misinterpretations including contrast illusions (identical colors appearing different), inverted colors due to lighting effects, and negative space illusions where background influences color perception |
Testing Methodologies
| Method | Description |
|---|---|
| Direct querying | Presenting images with optical illusions and asking for descriptions |
| Illusion-aware prompting | Providing contextual hints to test recognition of perceptual distortions |
| Comparative judgments | Comparing features across images to detect inconsistencies |
| Spatial reasoning | Testing understanding of geometric structures and impossible objects |
| In-context learning | Sequential illusion scenarios to observe performance improvement |
See Also
- Text to Image Risks - Image generation risks
- Text to Video Risks - Video generation risks
- Societal Harmfulness - Content harm prevention
- Privacy - Privacy risk assessment