Skip to main content

Leaderboard

The AgentSuite-Red leaderboard reuses the open-source leaderboard from the DecodingTrust-Agent platform, which evaluates leading agent frameworks and underlying LLMs across all 14 domains under both direct and indirect prompt-injection threat models.

Three metrics are reported per (framework, model, domain) combination:

MetricDescriptionDirection
Indirect ASRAttack success rate when adversarial content is hidden inside environment data (emails, documents, tool outputs).Lower is safer
Direct ASRAttack success rate when the adversary controls the user prompt directly.Lower is safer
BSRBenign Success Rate — task completion on non-adversarial tasks.Higher is better

A strong agent must achieve low Indirect ASR, low Direct ASR, and high BSR simultaneously: a model that refuses everything scores perfectly on ASR but fails BSR, and a model that follows every instruction scores perfectly on BSR but fails ASR.

Agents ranked by security vulnerability

Averaged across all 14 domains. Ranked by Indirect ASR (higher = more vulnerable to indirect prompt injection).

RankFrameworkModelIndirect ASR ↓Direct ASR ↓BSR ↑
1Google ADKGemini-3-Pro55.7%47.9%87.0%
2OpenAI AgentsGPT-5.246.7%58.8%80.5%
3OpenClawDeepSeek-V4-Pro41.7%59.6%83.3%
4OpenAI AgentsGPT-5.440.0%51.0%85.3%
5OpenClawGPT-5.235.6%38.6%78.5%
6OpenAI AgentsGPT-OSS-120B28.5%46.5%36.3%
7Claude CodeSonnet-4.525.2%26.8%80.8%
8OpenClawGPT-5.517.7%28.9%86.3%
9OpenClawOpus-4.610.6%21.4%85.3%
10Claude CodeOpus-4.68.1%22.3%85.6%

Indirect ASR ↓ by domain

Per-domain breakdown of Indirect ASR for every (framework, model) entry.

Framework / ModelWorkflowCRMCustomer ServiceTravelCodingBrowserResearchOS-FSWindowsmacOSFinanceLegalTelecomMedical
Google ADK / Gemini-3-Pro65.9%69.9%75.1%84.2%77.9%62.2%16.2%59.5%13.9%34.6%60.0%56.3%54.7%49.1%
OpenAI Agents / GPT-5.263.1%66.4%53.9%66.7%79.3%24.8%7.0%44.3%13.0%38.8%40.4%66.2%43.3%47.2%
OpenClaw / DeepSeek-V4-Pro47.0%44.4%69.8%48.3%9.5%29.7%17.3%51.1%5.6%0.0%54.8%76.6%63.4%65.9%
OpenAI Agents / GPT-5.450.8%53.8%54.7%55.0%62.2%42.9%14.2%37.5%10.6%26.7%35.9%58.1%20.8%37.5%
OpenClaw / GPT-5.226.1%50.9%51.7%32.5%65.3%27.6%5.8%20.0%9.6%26.7%26.2%65.0%42.3%48.2%
OpenAI Agents / GPT-OSS-120B7.2%9.3%21.4%40.8%15.7%41.3%17.3%32.8%14.7%18.3%33.6%30.9%53.1%62.1%
Claude Code / Sonnet-4.535.8%40.6%23.4%11.7%56.0%4.5%0.0%17.6%7.7%20.4%7.1%26.6%49.5%51.7%
OpenClaw / GPT-5.542.4%32.4%29.0%16.7%7.9%17.5%4.4%6.8%2.1%0.0%15.5%40.7%12.8%19.1%
OpenClaw / Opus-4.618.3%7.2%4.4%5.8%29.3%0.0%1.5%3.1%5.6%16.0%1.7%12.5%19.9%23.0%
Claude Code / Opus-4.611.1%3.9%1.9%0.0%15.3%0.0%0.0%1.2%7.7%28.0%0.6%9.5%17.6%16.7%

Direct ASR ↓ by domain

Per-domain breakdown of Direct ASR for every (framework, model) entry.

Framework / ModelWorkflowCRMCustomer ServiceTravelCodingBrowserResearchOS-FSWindowsmacOSFinanceLegalTelecomMedical
Google ADK / Gemini-3-Pro54.4%55.6%40.3%52.4%71.6%35.3%45.4%73.5%38.8%30.4%22.0%64.0%24.2%62.7%
OpenAI Agents / GPT-5.269.1%83.3%70.2%42.9%62.5%20.3%63.2%83.6%44.6%40.4%51.5%59.5%60.7%71.1%
OpenClaw / DeepSeek-V4-Pro58.0%55.7%75.5%60.0%77.9%27.7%54.7%82.9%33.3%10.4%68.0%67.0%82.5%81.0%
OpenAI Agents / GPT-5.457.6%67.8%64.7%32.4%62.5%29.2%54.1%84.0%47.9%30.4%30.5%54.0%38.5%60.4%
OpenClaw / GPT-5.230.9%67.8%61.4%12.4%32.9%22.8%58.7%33.7%31.7%26.2%26.0%51.5%41.8%43.1%
OpenAI Agents / GPT-OSS-120B61.7%38.9%31.2%54.3%60.0%10.0%55.8%70.0%33.3%20.0%37.8%60.5%56.0%61.3%
Claude Code / Sonnet-4.544.7%16.7%27.6%3.8%56.6%20.0%19.7%22.6%21.7%18.3%5.5%20.0%23.5%75.1%
OpenClaw / GPT-5.531.1%44.4%50.0%18.1%28.1%20.2%38.1%24.8%14.2%6.3%11.0%42.0%45.9%30.4%
OpenClaw / Opus-4.622.1%11.1%17.9%5.7%33.2%10.2%21.2%24.2%30.7%14.0%4.0%22.5%31.8%50.7%
Claude Code / Opus-4.619.2%11.1%1.3%2.9%33.3%8.8%20.8%26.2%50.7%26.0%4.0%30.0%22.6%54.7%

BSR ↑ by domain

Per-domain breakdown of BSR for every (framework, model) entry.

Framework / ModelWorkflowCRMCustomer ServiceTravelCodingBrowserResearchOS-FSWindowsmacOSFinanceLegalTelecomMedical
Google ADK / Gemini-3-Pro89.4%86.1%91.3%99.4%100.0%97.2%100.0%87.0%76.0%61.2%95.1%87.1%66.4%81.9%
OpenAI Agents / GPT-5.276.4%72.7%91.3%95.3%93.6%55.6%94.5%80.0%74.6%71.6%97.7%85.7%71.8%66.3%
OpenClaw / DeepSeek-V4-Pro96.4%83.6%83.8%95.1%99.8%83.1%96.0%57.6%67.3%33.1%92.4%92.6%93.8%91.0%
OpenAI Agents / GPT-5.479.8%78.8%89.7%90.6%98.5%77.1%95.0%85.7%77.5%76.0%96.6%90.5%71.8%86.1%
OpenClaw / GPT-5.261.0%75.2%89.1%91.4%91.1%56.9%95.0%78.6%58.1%74.5%92.8%90.9%75.5%68.4%
OpenAI Agents / GPT-OSS-120B56.8%4.2%77.1%0.0%49.5%25.0%30.0%61.7%67.9%12.0%12.8%9.7%65.0%35.8%
Claude Code / Sonnet-4.569.8%82.4%77.7%56.7%98.4%95.8%100.0%78.6%83.5%75.2%94.9%79.9%71.8%66.4%
OpenClaw / GPT-5.588.1%75.6%88.6%86.1%99.6%88.8%99.0%52.5%68.9%91.7%96.5%93.1%89.3%91.0%
OpenClaw / Opus-4.691.5%82.4%89.4%89.4%100.0%97.2%96.0%78.7%70.0%80.0%91.0%94.0%65.5%69.3%
Claude Code / Opus-4.682.7%83.0%89.8%89.4%98.4%97.2%100.0%83.3%78.0%83.3%96.5%91.0%68.2%57.8%

Submit your own results

To submit an agent for inclusion on the leaderboard, follow the contribution workflow in the open-source DecodingTrust-Agent repository on GitHub. Submissions are validated against the standard task list and judged with the same automated pipeline that produces leaderboard scores, so the comparison stays apples-to-apples.

Source: open-source DecodingTrust-Agent leaderboard (paper_v2).