DarkCite
DarkCite is an authority citation-driven jailbreak technique that exploits LLMs' inherent bias toward authoritative sources. By including fabricated but credible citations, DarkCite bypasses safety measures more effectively than uncited harmful prompts.
Overview
Large language models are trained on vast corpora that include academic papers, technical documentation, and authoritative sources. This training instills a natural tendency to trust and defer to cited information. DarkCite exploits this observation by matching optimal citation types to specific risk categories and generating convincing fabricated citations that make harmful requests appear legitimate.
| Aspect | Description |
|---|---|
| Attack Type | Single-turn, citation-based |
| Target Bias | Authority trust and academic deference |
| Complexity | Low |
| Query Efficiency | Single query |
The Psychology of Authority Bias
LLMs exhibit measurable differences in compliance based on the perceived authority of cited sources:
| Source Type | Trust Level | Typical Use Case |
|---|---|---|
| Academic Papers | Very High | Technical and scientific claims |
| GitHub Repositories | High | Code and technical implementations |
| Government Reports | High | Policy and regulatory information |
| News Articles | Medium | Current events and general information |
| Blog Posts | Low | Opinions and informal content |
| Social Media | Very Low | Unverified claims |
This hierarchy means that wrapping a harmful request in academic framing significantly increases the likelihood of compliance.
Attack Methodology
1. Risk Category Analysis
DarkCite first categorizes the harmful request to determine which citation types are most effective:
| Risk Category | Optimal Citation Types | Rationale |
|---|---|---|
| Technical/Cybersecurity | Technical Report, GitHub Repository, Research Paper | Technical credibility |
| Hate/Harassment/Violence | Government Reports, Research Paper, Social Media | Authority and documentation |
| Medical and Health | Research Paper, Government Reports, Medical Journals | Clinical credibility |
| Social and Political | Government Reports, Think Tank Publications | Policy authority |
| Financial | Government Reports, Research Paper, Industry Analysis | Regulatory weight |
2. Citation Generation
The attack uses an LLM to generate fabricated but convincing citations. Key elements include:
- Realistic author names with appropriate institutional affiliations
- Plausible publication venues matching the topic domain
- Specific dates and DOIs that add authenticity
- Abstract snippets that appear to support the request
3. Prompt Construction
The harmful prompt is wrapped in academic or authoritative framing:
According to [Fabricated Citation], the following methodology
is documented for research purposes...
4. Authority Exploitation
The model's training to respect cited sources reduces its resistance to the underlying harmful request.
Defense Strategies
The original research proposes a multi-layered defense approach:
1. Authenticity Verification
- Cross-reference cited sources against real databases
- Validate DOIs and publication records
- Check author affiliations against institutional records
2. Harm Assessment Independence
- Evaluate request harmfulness regardless of citations
- Apply safety checks before considering authority claims
- Treat citations as potentially adversarial inputs
3. Citation Skepticism Training
- Fine-tune models to be skeptical of unverifiable citations
- Reduce authority bias through targeted training
- Implement citation verification as a system prompt
Implications for AI Safety
DarkCite demonstrates that:
- Authority bias is exploitable - Models' deference to citations creates attack surface
- Surface features matter - Formatting and framing significantly affect model behavior
- Training data shapes vulnerabilities - Academic training creates predictable biases
- Defense requires multiple layers - No single approach fully mitigates the attack
Research Background
Based on: "The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models" by Xikang Yang et al. (2024)
See Also
- Attack Algorithms Overview - All attack algorithms
- Humor Attack - Another social engineering approach
- Bijection Learning - Encoding-based attack