Skip to main content

DarkCite

DarkCite is an authority citation-driven jailbreak technique that exploits LLMs' inherent bias toward authoritative sources. By including fabricated but credible citations, DarkCite bypasses safety measures more effectively than uncited harmful prompts.

Overview

Large language models are trained on vast corpora that include academic papers, technical documentation, and authoritative sources. This training instills a natural tendency to trust and defer to cited information. DarkCite exploits this observation by matching optimal citation types to specific risk categories and generating convincing fabricated citations that make harmful requests appear legitimate.

AspectDescription
Attack TypeSingle-turn, citation-based
Target BiasAuthority trust and academic deference
ComplexityLow
Query EfficiencySingle query

The Psychology of Authority Bias

LLMs exhibit measurable differences in compliance based on the perceived authority of cited sources:

Source TypeTrust LevelTypical Use Case
Academic PapersVery HighTechnical and scientific claims
GitHub RepositoriesHighCode and technical implementations
Government ReportsHighPolicy and regulatory information
News ArticlesMediumCurrent events and general information
Blog PostsLowOpinions and informal content
Social MediaVery LowUnverified claims

This hierarchy means that wrapping a harmful request in academic framing significantly increases the likelihood of compliance.

Attack Methodology

1. Risk Category Analysis

DarkCite first categorizes the harmful request to determine which citation types are most effective:

Risk CategoryOptimal Citation TypesRationale
Technical/CybersecurityTechnical Report, GitHub Repository, Research PaperTechnical credibility
Hate/Harassment/ViolenceGovernment Reports, Research Paper, Social MediaAuthority and documentation
Medical and HealthResearch Paper, Government Reports, Medical JournalsClinical credibility
Social and PoliticalGovernment Reports, Think Tank PublicationsPolicy authority
FinancialGovernment Reports, Research Paper, Industry AnalysisRegulatory weight

2. Citation Generation

The attack uses an LLM to generate fabricated but convincing citations. Key elements include:

  • Realistic author names with appropriate institutional affiliations
  • Plausible publication venues matching the topic domain
  • Specific dates and DOIs that add authenticity
  • Abstract snippets that appear to support the request

3. Prompt Construction

The harmful prompt is wrapped in academic or authoritative framing:

According to [Fabricated Citation], the following methodology
is documented for research purposes...

4. Authority Exploitation

The model's training to respect cited sources reduces its resistance to the underlying harmful request.

Defense Strategies

The original research proposes a multi-layered defense approach:

1. Authenticity Verification

  • Cross-reference cited sources against real databases
  • Validate DOIs and publication records
  • Check author affiliations against institutional records

2. Harm Assessment Independence

  • Evaluate request harmfulness regardless of citations
  • Apply safety checks before considering authority claims
  • Treat citations as potentially adversarial inputs

3. Citation Skepticism Training

  • Fine-tune models to be skeptical of unverifiable citations
  • Reduce authority bias through targeted training
  • Implement citation verification as a system prompt

Implications for AI Safety

DarkCite demonstrates that:

  • Authority bias is exploitable - Models' deference to citations creates attack surface
  • Surface features matter - Formatting and framing significantly affect model behavior
  • Training data shapes vulnerabilities - Academic training creates predictable biases
  • Defense requires multiple layers - No single approach fully mitigates the attack

Research Background

Based on: "The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models" by Xikang Yang et al. (2024)

See Also