Leaderboard
The AgentSuite-Red leaderboard reuses the open-source leaderboard from the DecodingTrust-Agent platform, which evaluates leading agent frameworks and underlying LLMs across all 14 domains under both direct and indirect prompt-injection threat models.
Three metrics are reported per (framework, model, domain) combination:
| Metric | Description | Direction |
|---|---|---|
| Indirect ASR | Attack success rate when adversarial content is hidden inside environment data (emails, documents, tool outputs). | Lower is safer |
| Direct ASR | Attack success rate when the adversary controls the user prompt directly. | Lower is safer |
| BSR | Benign Success Rate — task completion on non-adversarial tasks. | Higher is better |
A strong agent must achieve low Indirect ASR, low Direct ASR, and high BSR simultaneously: a model that refuses everything scores perfectly on ASR but fails BSR, and a model that follows every instruction scores perfectly on BSR but fails ASR.
Agents ranked by security vulnerability
Averaged across all 14 domains. Ranked by Indirect ASR (higher = more vulnerable to indirect prompt injection).
| Rank | Framework | Model | Indirect ASR ↓ | Direct ASR ↓ | BSR ↑ |
|---|---|---|---|---|---|
| 1 | Google ADK | Gemini-3-Pro | 55.7% | 47.9% | 87.0% |
| 2 | OpenAI Agents | GPT-5.2 | 46.7% | 58.8% | 80.5% |
| 3 | OpenClaw | DeepSeek-V4-Pro | 41.7% | 59.6% | 83.3% |
| 4 | OpenAI Agents | GPT-5.4 | 40.0% | 51.0% | 85.3% |
| 5 | OpenClaw | GPT-5.2 | 35.6% | 38.6% | 78.5% |
| 6 | OpenAI Agents | GPT-OSS-120B | 28.5% | 46.5% | 36.3% |
| 7 | Claude Code | Sonnet-4.5 | 25.2% | 26.8% | 80.8% |
| 8 | OpenClaw | GPT-5.5 | 17.7% | 28.9% | 86.3% |
| 9 | OpenClaw | Opus-4.6 | 10.6% | 21.4% | 85.3% |
| 10 | Claude Code | Opus-4.6 | 8.1% | 22.3% | 85.6% |
Indirect ASR ↓ by domain
Per-domain breakdown of Indirect ASR for every (framework, model) entry.
| Framework / Model | Workflow | CRM | Customer Service | Travel | Coding | Browser | Research | OS-FS | Windows | macOS | Finance | Legal | Telecom | Medical |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Google ADK / Gemini-3-Pro | 65.9% | 69.9% | 75.1% | 84.2% | 77.9% | 62.2% | 16.2% | 59.5% | 13.9% | 34.6% | 60.0% | 56.3% | 54.7% | 49.1% |
| OpenAI Agents / GPT-5.2 | 63.1% | 66.4% | 53.9% | 66.7% | 79.3% | 24.8% | 7.0% | 44.3% | 13.0% | 38.8% | 40.4% | 66.2% | 43.3% | 47.2% |
| OpenClaw / DeepSeek-V4-Pro | 47.0% | 44.4% | 69.8% | 48.3% | 9.5% | 29.7% | 17.3% | 51.1% | 5.6% | 0.0% | 54.8% | 76.6% | 63.4% | 65.9% |
| OpenAI Agents / GPT-5.4 | 50.8% | 53.8% | 54.7% | 55.0% | 62.2% | 42.9% | 14.2% | 37.5% | 10.6% | 26.7% | 35.9% | 58.1% | 20.8% | 37.5% |
| OpenClaw / GPT-5.2 | 26.1% | 50.9% | 51.7% | 32.5% | 65.3% | 27.6% | 5.8% | 20.0% | 9.6% | 26.7% | 26.2% | 65.0% | 42.3% | 48.2% |
| OpenAI Agents / GPT-OSS-120B | 7.2% | 9.3% | 21.4% | 40.8% | 15.7% | 41.3% | 17.3% | 32.8% | 14.7% | 18.3% | 33.6% | 30.9% | 53.1% | 62.1% |
| Claude Code / Sonnet-4.5 | 35.8% | 40.6% | 23.4% | 11.7% | 56.0% | 4.5% | 0.0% | 17.6% | 7.7% | 20.4% | 7.1% | 26.6% | 49.5% | 51.7% |
| OpenClaw / GPT-5.5 | 42.4% | 32.4% | 29.0% | 16.7% | 7.9% | 17.5% | 4.4% | 6.8% | 2.1% | 0.0% | 15.5% | 40.7% | 12.8% | 19.1% |
| OpenClaw / Opus-4.6 | 18.3% | 7.2% | 4.4% | 5.8% | 29.3% | 0.0% | 1.5% | 3.1% | 5.6% | 16.0% | 1.7% | 12.5% | 19.9% | 23.0% |
| Claude Code / Opus-4.6 | 11.1% | 3.9% | 1.9% | 0.0% | 15.3% | 0.0% | 0.0% | 1.2% | 7.7% | 28.0% | 0.6% | 9.5% | 17.6% | 16.7% |
Direct ASR ↓ by domain
Per-domain breakdown of Direct ASR for every (framework, model) entry.
| Framework / Model | Workflow | CRM | Customer Service | Travel | Coding | Browser | Research | OS-FS | Windows | macOS | Finance | Legal | Telecom | Medical |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Google ADK / Gemini-3-Pro | 54.4% | 55.6% | 40.3% | 52.4% | 71.6% | 35.3% | 45.4% | 73.5% | 38.8% | 30.4% | 22.0% | 64.0% | 24.2% | 62.7% |
| OpenAI Agents / GPT-5.2 | 69.1% | 83.3% | 70.2% | 42.9% | 62.5% | 20.3% | 63.2% | 83.6% | 44.6% | 40.4% | 51.5% | 59.5% | 60.7% | 71.1% |
| OpenClaw / DeepSeek-V4-Pro | 58.0% | 55.7% | 75.5% | 60.0% | 77.9% | 27.7% | 54.7% | 82.9% | 33.3% | 10.4% | 68.0% | 67.0% | 82.5% | 81.0% |
| OpenAI Agents / GPT-5.4 | 57.6% | 67.8% | 64.7% | 32.4% | 62.5% | 29.2% | 54.1% | 84.0% | 47.9% | 30.4% | 30.5% | 54.0% | 38.5% | 60.4% |
| OpenClaw / GPT-5.2 | 30.9% | 67.8% | 61.4% | 12.4% | 32.9% | 22.8% | 58.7% | 33.7% | 31.7% | 26.2% | 26.0% | 51.5% | 41.8% | 43.1% |
| OpenAI Agents / GPT-OSS-120B | 61.7% | 38.9% | 31.2% | 54.3% | 60.0% | 10.0% | 55.8% | 70.0% | 33.3% | 20.0% | 37.8% | 60.5% | 56.0% | 61.3% |
| Claude Code / Sonnet-4.5 | 44.7% | 16.7% | 27.6% | 3.8% | 56.6% | 20.0% | 19.7% | 22.6% | 21.7% | 18.3% | 5.5% | 20.0% | 23.5% | 75.1% |
| OpenClaw / GPT-5.5 | 31.1% | 44.4% | 50.0% | 18.1% | 28.1% | 20.2% | 38.1% | 24.8% | 14.2% | 6.3% | 11.0% | 42.0% | 45.9% | 30.4% |
| OpenClaw / Opus-4.6 | 22.1% | 11.1% | 17.9% | 5.7% | 33.2% | 10.2% | 21.2% | 24.2% | 30.7% | 14.0% | 4.0% | 22.5% | 31.8% | 50.7% |
| Claude Code / Opus-4.6 | 19.2% | 11.1% | 1.3% | 2.9% | 33.3% | 8.8% | 20.8% | 26.2% | 50.7% | 26.0% | 4.0% | 30.0% | 22.6% | 54.7% |
BSR ↑ by domain
Per-domain breakdown of BSR for every (framework, model) entry.
| Framework / Model | Workflow | CRM | Customer Service | Travel | Coding | Browser | Research | OS-FS | Windows | macOS | Finance | Legal | Telecom | Medical |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Google ADK / Gemini-3-Pro | 89.4% | 86.1% | 91.3% | 99.4% | 100.0% | 97.2% | 100.0% | 87.0% | 76.0% | 61.2% | 95.1% | 87.1% | 66.4% | 81.9% |
| OpenAI Agents / GPT-5.2 | 76.4% | 72.7% | 91.3% | 95.3% | 93.6% | 55.6% | 94.5% | 80.0% | 74.6% | 71.6% | 97.7% | 85.7% | 71.8% | 66.3% |
| OpenClaw / DeepSeek-V4-Pro | 96.4% | 83.6% | 83.8% | 95.1% | 99.8% | 83.1% | 96.0% | 57.6% | 67.3% | 33.1% | 92.4% | 92.6% | 93.8% | 91.0% |
| OpenAI Agents / GPT-5.4 | 79.8% | 78.8% | 89.7% | 90.6% | 98.5% | 77.1% | 95.0% | 85.7% | 77.5% | 76.0% | 96.6% | 90.5% | 71.8% | 86.1% |
| OpenClaw / GPT-5.2 | 61.0% | 75.2% | 89.1% | 91.4% | 91.1% | 56.9% | 95.0% | 78.6% | 58.1% | 74.5% | 92.8% | 90.9% | 75.5% | 68.4% |
| OpenAI Agents / GPT-OSS-120B | 56.8% | 4.2% | 77.1% | 0.0% | 49.5% | 25.0% | 30.0% | 61.7% | 67.9% | 12.0% | 12.8% | 9.7% | 65.0% | 35.8% |
| Claude Code / Sonnet-4.5 | 69.8% | 82.4% | 77.7% | 56.7% | 98.4% | 95.8% | 100.0% | 78.6% | 83.5% | 75.2% | 94.9% | 79.9% | 71.8% | 66.4% |
| OpenClaw / GPT-5.5 | 88.1% | 75.6% | 88.6% | 86.1% | 99.6% | 88.8% | 99.0% | 52.5% | 68.9% | 91.7% | 96.5% | 93.1% | 89.3% | 91.0% |
| OpenClaw / Opus-4.6 | 91.5% | 82.4% | 89.4% | 89.4% | 100.0% | 97.2% | 96.0% | 78.7% | 70.0% | 80.0% | 91.0% | 94.0% | 65.5% | 69.3% |
| Claude Code / Opus-4.6 | 82.7% | 83.0% | 89.8% | 89.4% | 98.4% | 97.2% | 100.0% | 83.3% | 78.0% | 83.3% | 96.5% | 91.0% | 68.2% | 57.8% |
Submit your own results
To submit an agent for inclusion on the leaderboard, follow the contribution workflow in the open-source DecodingTrust-Agent repository on GitHub. Submissions are validated against the standard task list and judged with the same automated pipeline that produces leaderboard scores, so the comparison stays apples-to-apples.
Source: open-source DecodingTrust-Agent leaderboard (paper_v2).