Text-to-image generation models present unique safety risks when converting text prompts into visual content. VirtueRed comprehensively tests image generation systems for harmful content creation across 6 critical risk categories.
Overview
Text-to-image models can generate photorealistic or artistic images from text descriptions. This capability creates risks including generation of harmful imagery, deepfakes, copyrighted content reproduction, and discriminatory representations. Safety evaluation must address both explicit harmful requests and subtle prompt manipulation techniques.
| Risk Category | Subcategories |
|---|
| Hateful Image Generation | Bigotry, Body-Shaming, Extremism, Harassment, Hateful Meme, Hateful Symbols, Racism |
| Illegal Activity Image Generation | Celebrity Face, Copyright, Illegal Goods, Terrorism, Identity Theft |
| Political Image Generation | Misinformation, Politically Sensitive, Riot |
| Self-harm Image Generation | Encouragement, Risk |
| Sexual/NSFW Image Generation | Deepfake, Fetish, Hentai, General Sexual Acts, Sexual Assault, Child Sexual Exploitation |
| Violence Image Generation | Animal Abuse, Bloody, Disturbing/Shocking, Fights/Real-Life, Gore, Threats, Child Violence |
Hateful Image Generation
Testing generation of discriminatory, hateful, or dehumanizing imagery targeting individuals or groups.
Subcategories
| Subcategory | Description |
|---|
| Bigotry | Imagery expressing intolerance or prejudice toward groups based on religion, ethnicity, nationality, or other protected characteristics |
| Body-Shaming | Content mocking or degrading individuals based on body type, weight, physical appearance, or disabilities |
| Extremism | Imagery promoting extremist ideologies, radical movements, or violent radical organizations |
| Harassment | Visual content designed to intimidate, threaten, or harass specific individuals or groups |
| Hateful Meme | Generation of meme-style content that spreads hate, discrimination, or derogatory messages |
| Hateful Symbols | Generation of recognized hate symbols, including Nazi imagery, white supremacist symbols, and other hate group insignia |
| Racism | Racially offensive imagery including stereotypical representations, derogatory depictions, and content promoting racial discrimination |
Illegal Activity Image Generation
Testing generation of imagery facilitating illegal activities, identity violations, or intellectual property infringement.
Subcategories
| Subcategory | Description |
|---|
| Celebrity Face | Unauthorized generation of celebrity or public figure likenesses, including deepfakes and images placing them in false contexts |
| Copyright | Reproduction of copyrighted characters, artworks, logos, brand imagery, and protected intellectual property |
| Illegal Goods | Imagery depicting controlled substances, weapons, counterfeit products, or other illegal items |
| Terrorism | Content depicting terrorist activities, propaganda materials, or imagery promoting terrorist organizations |
| Identity Theft | Fake identity documents, fraudulent photos for impersonation, and synthetic identity materials for fraud |
Political Image Generation
Testing generation of politically sensitive content that could influence opinions, spread misinformation, or incite unrest.
Subcategories
| Subcategory | Description |
|---|
| Misinformation | Fake news imagery, fabricated political events, false endorsements, and manipulated political "evidence" |
| Politically Sensitive | Imagery involving controversial political figures, disputed territories, contested historical events, or divisive political topics |
| Riot | Content depicting civil unrest, violent protests, insurrection, or imagery that could incite public disorder |
Self-harm Image Generation
Testing generation of content that promotes, glorifies, or provides guidance on self-harm behaviors.
Subcategories
| Subcategory | Description |
|---|
| Encouragement | Imagery that encourages, normalizes, or glorifies self-harm, suicide, eating disorders, or other self-destructive behaviors |
| Risk | Content depicting dangerous challenges, risky behaviors, or activities that could lead to physical harm when imitated |
Sexual/NSFW Image Generation
Testing generation of explicit, inappropriate, or exploitative sexual content.
Subcategories
| Subcategory | Description |
|---|
| Deepfake | Non-consensual intimate imagery using face-swapping or AI generation to place real individuals in sexual contexts |
| Fetish | Extreme fetish content that may be harmful, degrading, or non-consensual in nature |
| Hentai | Animated or illustrated explicit sexual content, including content that may depict fictional minors |
| General Sexual Acts | Explicit sexual imagery depicting sexual activities, nudity, or pornographic content |
| Sexual Assault | Imagery depicting non-consensual sexual activities, coercion, or sexual violence |
| Child Sexual Exploitation | Any imagery sexualizing minors or depicting child sexual abuse material (CSAM) - absolute prohibition |
Violence Image Generation
Testing generation of violent, gory, or disturbing content that depicts harm to humans or animals.
Subcategories
| Subcategory | Description |
|---|
| Animal Abuse | Imagery depicting cruelty, torture, or harm to animals |
| Bloody | Content featuring blood, bleeding wounds, or bloody scenes |
| Disturbing/Shocking | Deeply unsettling imagery designed to shock, traumatize, or cause psychological distress |
| Fights/Real-Life | Realistic depictions of physical altercations, assaults, or violent confrontations |
| Gore | Graphic depictions of severe injuries, mutilation, dismemberment, or extreme violence |
| Threats | Imagery that threatens violence against specific individuals, groups, or institutions |
| Child Violence | Any imagery depicting violence, abuse, or harm to minors - absolute prohibition |
Prompt Manipulation Techniques
Direct Bypass Attempts
| Technique | Description |
|---|
| Synonym substitution | Using alternative words to bypass filters |
| Foreign language prompts | Using non-English to evade detection |
| Phonetic spelling | Spelling words phonetically to avoid detection |
| Character substitution | Using similar characters to bypass filters |
Contextual Manipulation
| Technique | Description |
|---|
| Artistic framing | Claiming artistic or educational purpose |
| Historical context | Framing harmful content as historical |
| Fictional scenarios | Using fiction to request harmful content |
| Professional pretext | Claiming professional need for content |
See Also