Societal harmfulness testing evaluates AI systems' potential to generate content that could cause direct harm to individuals, communities, or society at large. This comprehensive assessment covers 45 subcategories spanning personal attacks, criminal facilitation, misinformation, and inappropriate influence. VirtueRed provides extensive testing to ensure AI systems maintain appropriate boundaries.
Overview
Societal harm represents one of the most critical risk categories for deployed AI systems. Unlike regulatory compliance which focuses on legal requirements, societal harmfulness testing addresses the broader ethical responsibility of AI developers to prevent their systems from causing real-world harm.
| Risk Domain | Categories | Focus Area |
|---|
| Personal Harm | Insults, threats, harassment | Individual safety |
| Criminal Content | Violent crimes, fraud, terrorism | Law enforcement |
| Inappropriate Content | Sexual, explicit, controversial | Content moderation |
| Misinformation | Fake news, false advertising | Information integrity |
| Harmful Influence | Political, religious, extremist | Social stability |
Personal Harm Categories
Direct Personal Attacks
| Risk Category | Description |
|---|
| Personal Insults | Generating demeaning or degrading content targeting individuals |
| Social Group Insults | Creating content that demeans groups based on identity |
| Threats | Producing intimidating or threatening content |
| Lewd and Obscene Language | Generating inappropriate vulgar content |
| Libel and Defamation | Creating false statements damaging reputation |
| Stalking and Harassment | Facilitating persistent unwanted contact |
Self-Harm Content
| Risk Category | Description |
|---|
| Self-Harm | Content encouraging or instructing self-injury |
| Suicide Facilitation | Information promoting or assisting suicide |
Criminal Content Categories
Violent Crimes
| Risk Category | Description |
|---|
| Violent Crimes | Instructions or encouragement for violence |
| Sexual Crimes | Content facilitating sexual offenses |
| Property Crimes | Guidance for theft, vandalism, or destruction |
| Public Order Crimes | Inciting riots, disturbances, or disorder |
Cyber and Financial Crimes
| Risk Category | Description |
|---|
| Advice on System Intrusion, Hacking | Technical guidance for unauthorized access |
| Malware Code Generation | Creating malicious software |
| Fraud, Scams, and Deception | Designing fraudulent schemes |
| Financial Crimes | Facilitating money laundering, tax evasion |
| Intellectual Property Infringement | Copyright violation assistance |
Serious Offenses
| Risk Category | Description |
|---|
| Terrorism | Content supporting terrorist activities |
| Child-Related Crimes | Any content endangering minors |
| Human Elimination and World Domination | Extreme harmful ideation |
| Illegal Manufacturing and Distribution | Drug or weapon production |
| Evasion of Law Enforcement | Helping avoid legal consequences |
Other Criminal Activity
| Risk Category | Description |
|---|
| Impersonation | Facilitating identity theft or fraud |
| Animal-Related Crimes | Animal abuse or illegal wildlife trade |
| Environmental Crimes | Pollution, illegal dumping, habitat destruction |
| Disclosure of Personal Identifiable Information | Unauthorized PII exposure |
Inappropriate Content Categories
Sexual and Explicit Content
| Risk Category | Description |
|---|
| Advice on Adult Content | Guidance for creating adult material |
| Generation of Sexually Explicit Content | Producing sexual content |
| Generation of Non-Sexual Explicit Content | Creating graphic violent content |
Controversial Professional Advice
| Risk Category | Description |
|---|
| Medical Advice | Unqualified medical recommendations |
| Financial Advice | Unsuitable financial guidance |
| Legal Consulting Advice | Unauthorized legal counsel |
| Governance Decision Advice | Inappropriate policy recommendations |
| Advice on Operating Dangerous Machinery | Unsafe operational guidance |
| Risk Category | Description |
|---|
| Fake News Generation | Creating fabricated news articles |
| False Advertising | Generating deceptive marketing content |
| Promotion of False Common Knowledge | Spreading widely-believed myths |
| Promotion of Conspiracy Theories | Amplifying unfounded theories |
Harmful Influence Categories
Political and Social Influence
| Risk Category | Description |
|---|
| Political Belief Promotion | Pushing specific political agendas |
| Religious Promotion | Advocating for religious positions |
| Ethical Belief Promotion | Imposing specific ethical frameworks |
| Military Use | Content for military applications |
Extremism and Stereotypes
| Risk Category | Description |
|---|
| Promotion of Social Stereotypes | Reinforcing harmful stereotypes |
| Promotion of Extremist Content | Amplifying radical ideologies |
| Promotion of Harmful and Unhealthy Behaviors | Encouraging dangerous activities |
Testing Methodology
VirtueRed employs multiple attack strategies to comprehensively test societal harm boundaries:
Direct Testing
Testing explicit requests for harmful content:
- Straightforward harmful requests
- Contextually framed requests
- Role-play scenarios
- Fictional framing attempts
Jailbreak Techniques
Advanced bypass attempts:
- DarkCite: Authority-based manipulation using fake citations
- Crescendo: Multi-turn escalation building trust before harmful requests
- Flip Attack: Text obfuscation to bypass content filters
- Bijection Learning: Encoded language to disguise harmful intent
Contextual Manipulation
Scenario-based testing:
- Educational framing ("for research purposes")
- Creative writing contexts
- Historical or hypothetical scenarios
- Professional advice contexts
Severity Levels
VirtueRed categorizes societal harm into severity levels:
| Level | Description | Examples |
|---|
| Critical | Immediate physical harm potential | Violence instructions, CSAM |
| High | Significant harm potential | Fraud schemes, harassment |
| Medium | Moderate harm potential | Misinformation, stereotypes |
| Low | Limited harm potential | Mild insults, controversial opinions |
Mitigation Strategies
Organizations should implement:
- Content filtering - Input and output moderation
- Context awareness - Understanding request intent
- Refusal training - Appropriate decline responses
- Escalation protocols - Human review for edge cases
- Continuous monitoring - Ongoing evaluation of model behavior
See Also