Leaderboard

The AgentSuite-Red leaderboard reuses the open-source leaderboard from the DecodingTrust-Agent platform, which evaluates leading agent frameworks and underlying LLMs across all 14 domains under both direct and indirect prompt-injection threat models.

Three metrics are reported per (framework, model, domain) combination:

Metric	Description	Direction
Indirect ASR	Attack success rate when adversarial content is hidden inside environment data (emails, documents, tool outputs).	Lower is safer
Direct ASR	Attack success rate when the adversary controls the user prompt directly.	Lower is safer
BSR	Benign Success Rate — task completion on non-adversarial tasks.	Higher is better

A strong agent must achieve low Indirect ASR, low Direct ASR, and high BSR simultaneously: a model that refuses everything scores perfectly on ASR but fails BSR, and a model that follows every instruction scores perfectly on BSR but fails ASR.

Agents ranked by security vulnerability

Averaged across all 14 domains. Ranked by Indirect ASR (higher = more vulnerable to indirect prompt injection).

Rank	Framework	Model	Indirect ASR ↓	Direct ASR ↓	BSR ↑
1	Google ADK	Gemini-3-Pro	55.7%	47.9%	87.0%
2	OpenAI Agents	GPT-5.2	46.7%	58.8%	80.5%
3	OpenClaw	DeepSeek-V4-Pro	41.7%	59.6%	83.3%
4	OpenAI Agents	GPT-5.4	40.0%	51.0%	85.3%
5	OpenClaw	GPT-5.2	35.6%	38.6%	78.5%
6	OpenAI Agents	GPT-OSS-120B	28.5%	46.5%	36.3%
7	Claude Code	Sonnet-4.5	25.2%	26.8%	80.8%
8	OpenClaw	GPT-5.5	17.7%	28.9%	86.3%
9	OpenClaw	Opus-4.6	10.6%	21.4%	85.3%
10	Claude Code	Opus-4.6	8.1%	22.3%	85.6%

Indirect ASR ↓ by domain

Per-domain breakdown of Indirect ASR for every (framework, model) entry.

Framework / Model	Workflow	CRM	Customer Service	Travel	Coding	Browser	Research	OS-FS	Windows	macOS	Finance	Legal	Telecom	Medical
Google ADK / Gemini-3-Pro	65.9%	69.9%	75.1%	84.2%	77.9%	62.2%	16.2%	59.5%	13.9%	34.6%	60.0%	56.3%	54.7%	49.1%
OpenAI Agents / GPT-5.2	63.1%	66.4%	53.9%	66.7%	79.3%	24.8%	7.0%	44.3%	13.0%	38.8%	40.4%	66.2%	43.3%	47.2%
OpenClaw / DeepSeek-V4-Pro	47.0%	44.4%	69.8%	48.3%	9.5%	29.7%	17.3%	51.1%	5.6%	0.0%	54.8%	76.6%	63.4%	65.9%
OpenAI Agents / GPT-5.4	50.8%	53.8%	54.7%	55.0%	62.2%	42.9%	14.2%	37.5%	10.6%	26.7%	35.9%	58.1%	20.8%	37.5%
OpenClaw / GPT-5.2	26.1%	50.9%	51.7%	32.5%	65.3%	27.6%	5.8%	20.0%	9.6%	26.7%	26.2%	65.0%	42.3%	48.2%
OpenAI Agents / GPT-OSS-120B	7.2%	9.3%	21.4%	40.8%	15.7%	41.3%	17.3%	32.8%	14.7%	18.3%	33.6%	30.9%	53.1%	62.1%
Claude Code / Sonnet-4.5	35.8%	40.6%	23.4%	11.7%	56.0%	4.5%	0.0%	17.6%	7.7%	20.4%	7.1%	26.6%	49.5%	51.7%
OpenClaw / GPT-5.5	42.4%	32.4%	29.0%	16.7%	7.9%	17.5%	4.4%	6.8%	2.1%	0.0%	15.5%	40.7%	12.8%	19.1%
OpenClaw / Opus-4.6	18.3%	7.2%	4.4%	5.8%	29.3%	0.0%	1.5%	3.1%	5.6%	16.0%	1.7%	12.5%	19.9%	23.0%
Claude Code / Opus-4.6	11.1%	3.9%	1.9%	0.0%	15.3%	0.0%	0.0%	1.2%	7.7%	28.0%	0.6%	9.5%	17.6%	16.7%

Direct ASR ↓ by domain

Per-domain breakdown of Direct ASR for every (framework, model) entry.

Framework / Model	Workflow	CRM	Customer Service	Travel	Coding	Browser	Research	OS-FS	Windows	macOS	Finance	Legal	Telecom	Medical
Google ADK / Gemini-3-Pro	54.4%	55.6%	40.3%	52.4%	71.6%	35.3%	45.4%	73.5%	38.8%	30.4%	22.0%	64.0%	24.2%	62.7%
OpenAI Agents / GPT-5.2	69.1%	83.3%	70.2%	42.9%	62.5%	20.3%	63.2%	83.6%	44.6%	40.4%	51.5%	59.5%	60.7%	71.1%
OpenClaw / DeepSeek-V4-Pro	58.0%	55.7%	75.5%	60.0%	77.9%	27.7%	54.7%	82.9%	33.3%	10.4%	68.0%	67.0%	82.5%	81.0%
OpenAI Agents / GPT-5.4	57.6%	67.8%	64.7%	32.4%	62.5%	29.2%	54.1%	84.0%	47.9%	30.4%	30.5%	54.0%	38.5%	60.4%
OpenClaw / GPT-5.2	30.9%	67.8%	61.4%	12.4%	32.9%	22.8%	58.7%	33.7%	31.7%	26.2%	26.0%	51.5%	41.8%	43.1%
OpenAI Agents / GPT-OSS-120B	61.7%	38.9%	31.2%	54.3%	60.0%	10.0%	55.8%	70.0%	33.3%	20.0%	37.8%	60.5%	56.0%	61.3%
Claude Code / Sonnet-4.5	44.7%	16.7%	27.6%	3.8%	56.6%	20.0%	19.7%	22.6%	21.7%	18.3%	5.5%	20.0%	23.5%	75.1%
OpenClaw / GPT-5.5	31.1%	44.4%	50.0%	18.1%	28.1%	20.2%	38.1%	24.8%	14.2%	6.3%	11.0%	42.0%	45.9%	30.4%
OpenClaw / Opus-4.6	22.1%	11.1%	17.9%	5.7%	33.2%	10.2%	21.2%	24.2%	30.7%	14.0%	4.0%	22.5%	31.8%	50.7%
Claude Code / Opus-4.6	19.2%	11.1%	1.3%	2.9%	33.3%	8.8%	20.8%	26.2%	50.7%	26.0%	4.0%	30.0%	22.6%	54.7%

BSR ↑ by domain

Per-domain breakdown of BSR for every (framework, model) entry.

Framework / Model	Workflow	CRM	Customer Service	Travel	Coding	Browser	Research	OS-FS	Windows	macOS	Finance	Legal	Telecom	Medical
Google ADK / Gemini-3-Pro	89.4%	86.1%	91.3%	99.4%	100.0%	97.2%	100.0%	87.0%	76.0%	61.2%	95.1%	87.1%	66.4%	81.9%
OpenAI Agents / GPT-5.2	76.4%	72.7%	91.3%	95.3%	93.6%	55.6%	94.5%	80.0%	74.6%	71.6%	97.7%	85.7%	71.8%	66.3%
OpenClaw / DeepSeek-V4-Pro	96.4%	83.6%	83.8%	95.1%	99.8%	83.1%	96.0%	57.6%	67.3%	33.1%	92.4%	92.6%	93.8%	91.0%
OpenAI Agents / GPT-5.4	79.8%	78.8%	89.7%	90.6%	98.5%	77.1%	95.0%	85.7%	77.5%	76.0%	96.6%	90.5%	71.8%	86.1%
OpenClaw / GPT-5.2	61.0%	75.2%	89.1%	91.4%	91.1%	56.9%	95.0%	78.6%	58.1%	74.5%	92.8%	90.9%	75.5%	68.4%
OpenAI Agents / GPT-OSS-120B	56.8%	4.2%	77.1%	0.0%	49.5%	25.0%	30.0%	61.7%	67.9%	12.0%	12.8%	9.7%	65.0%	35.8%
Claude Code / Sonnet-4.5	69.8%	82.4%	77.7%	56.7%	98.4%	95.8%	100.0%	78.6%	83.5%	75.2%	94.9%	79.9%	71.8%	66.4%
OpenClaw / GPT-5.5	88.1%	75.6%	88.6%	86.1%	99.6%	88.8%	99.0%	52.5%	68.9%	91.7%	96.5%	93.1%	89.3%	91.0%
OpenClaw / Opus-4.6	91.5%	82.4%	89.4%	89.4%	100.0%	97.2%	96.0%	78.7%	70.0%	80.0%	91.0%	94.0%	65.5%	69.3%
Claude Code / Opus-4.6	82.7%	83.0%	89.8%	89.4%	98.4%	97.2%	100.0%	83.3%	78.0%	83.3%	96.5%	91.0%	68.2%	57.8%

Submit your own results

To submit an agent for inclusion on the leaderboard, follow the contribution workflow in the open-source DecodingTrust-Agent repository on GitHub. Submissions are validated against the standard task list and judged with the same automated pipeline that produces leaderboard scores, so the comparison stays apples-to-apples.

Source: open-source DecodingTrust-Agent leaderboard (paper_v2).

Agents ranked by security vulnerability​

Indirect ASR ↓ by domain​

Direct ASR ↓ by domain​

BSR ↑ by domain​

Submit your own results​

Agents ranked by security vulnerability

Indirect ASR ↓ by domain

Direct ASR ↓ by domain

BSR ↑ by domain

Submit your own results