Skip to main content

On-Prem Deployment

Deploy Virtue AgentSuite-Red on-premise with Docker Compose. AgentSuite-Red is an automated red-teaming platform for AI agents that runs adversarial evaluations against any HTTP-accessible agent, exercising it through Virtue Agent ForgingGround — a pool of containerized environment sandboxes (Salesforce, Gmail, Slack, Atlassian, GitLab, BigQuery, etc.).

Prerequisites

RequirementDetails
Operating SystemUbuntu 22.04 LTS (or RHEL 9), x86_64
DockerDocker Engine 24+ with Compose v2 plugin
Python3.11+ (only on the host that runs gen_compose)
uvLatest — docs.astral.sh/uv
Git2.40+ with submodule support
CPU12 cores minimum (16+ recommended)
RAM24 GB minimum (32 GB recommended)
Disk500 GB SSD (~30 GB for env images, the rest for trajectories and DB)
GPUNot required
Outbound networkRequired for first-time image pulls and PyPI dependencies

Components

All images are hosted in us-docker.pkg.dev/customer-docker-virtueai/agentsuite-red/. There are two categories.

Core services

ImageDescription
agentsuite-red/backendFastAPI orchestrator — schedules evaluations, runs the in-process MCP proxy, calls the target agent, persists results. Listens on port 38085.
agentsuite-red/env-serverDocker-in-Docker env-pool manager — starts/resets sandbox environments and spawns MCP server subprocesses. Listens on port 8091.
agentsuite-red/frontendReact dashboard served by nginx. Listens on port 22100; reverse-proxies /api/* and /forgingground/mcp to the backend.
postgres:17Single backend database (agentsuite_red).

Sandbox environment pool

The env-server starts a pre-warmed pool of sandboxed applications. Each environment can run multiple instances in parallel. The default pool (defined in env_server/pool.yaml):

DomainEnvironmentsDefault count
CRMsalesforce, gmail4, 8
Communicationslack, calendar, zoom, telegram, whatsapp4, 4, 4, 4, 8
Code / DevOpsatlassian, terminal, googledocs4, 4, 8
Financepaypal, finance4, 4
Customer Servicecustomer_service4
OSOS-filesystem4
Traveltravel-suite4
Workflowbigquery, snowflake, databricks, google-form4, 4, 4, 4

count is the number of pre-started instances and equals the maximum parallelism for that environment.

Step 1 — Get the code bundle

Extract the bundle delivered by Virtue AI, which contains the deployment files for all images:

unzip agentsuite-red.zip
cd agentsuite-red

Step 2 — Authenticate with the image registry

If you are pulling pre-built images, authenticate to the Virtue AI registry with the GCP service-account key included in the bundle:

docker login -u _json_key --password-stdin https://us-docker.pkg.dev < serviceaccount.json

If you are building images from source (the default docker-compose.yml does this), skip this step.

Step 3 — Configure deployment values

Copy the environment file template and edit it:

cp .env.example .env
$EDITOR .env

Required settings:

VariableWhat to set
AGENTSUITE_DATABASE_URLLeave at postgresql+asyncpg://postgres:postgres@localhost:5432/agentsuite_red for single-node. Point at an external Postgres for production.
AGENTSUITE_PROXY_MCP_URLThe public URL the target agent will use — e.g. https://red.acme.internal/forgingground/mcp. Shown in the UI as the value users paste into their agent config.

Optional — change if applicable:

VariableWhen to changeDefault
AGENTSUITE_MAX_CONCURRENT_TASKSRaise once the host has observed headroom.5
AGENTSUITE_AGENT_REQUEST_TIMEOUTIncrease if the target agent is slow.600.0 (seconds)
AGENTSUITE_AUTH_ENABLEDSet true for production.false
AGENTSUITE_VIRTUE_AUTH_URLOIDC issuer URL when auth is enabled.empty
AGENTSUITE_JWT_SECRETHS256 secret if using local JWT instead of OIDC.empty

If you enable auth, generate a strong JWT secret:

SECRET=$(openssl rand -hex 32) && echo "JWT secret: $SECRET"

Paste it into AGENTSUITE_JWT_SECRET in .env.

Step 4 — Tune the sandbox pool

env_server/pool.yaml controls which environments are pre-started and at what fan-out. The default file is suitable for a PoC. To change parallelism for a specific environment:

# env_server/pool.yaml
pools:
salesforce:
count: 8 # was 4 — bump to support more parallel CRM tasks
gmail:
count: 8
# ...

Each additional instance costs roughly 50–500 MB of RAM depending on the environment (Salesforce is the heaviest at ~300 MB; Gmail Mailpit ~100 MB).

After editing pool.yaml, the pool compose file is regenerated automatically by start.sh in Step 5. To regenerate manually:

uv run python -m env_server.gen_compose

Step 5 — Deploy the stack

./start.sh

start.sh performs two actions:

  1. Runs uv run python -m env_server.gen_compose to regenerate env_server/pool-compose.yml from pool.yaml.
  2. Runs docker compose up --build -d, which starts PostgreSQL, env-server, backend, frontend, and ~100–140 sandbox environment containers from pool-compose.yml.
First start is slow

Docker pulls ~26 distinct sandbox images and builds three local images. Expect 15–30 minutes on first start. Subsequent restarts complete in 1–2 minutes.

Step 6 — Verify deployment

Check that all four core services are up:

docker compose ps --format "table {{.Service}}\t{{.Status}}" \
| grep -E "^(postgres|env-server|backend|frontend)\b"

Expected:

postgres     Up X minutes (healthy)
env-server Up X minutes
backend Up X minutes
frontend Up X minutes

Hit each service's HTTP endpoint:

curl -fsS http://localhost:38085/health        # backend
curl -fsS http://localhost:8091/health # env-server
curl -fsS -o /dev/null -w "%{http_code}\n" http://localhost:22100/ # frontend (200)

Local endpoints once everything is up:

ServiceURL
Dashboardhttp://localhost:22100
Backend APIhttp://localhost:38085
env-server APIhttp://localhost:8091
MCP proxy (for the target agent)http://localhost:22100/forgingground/mcp

Step 7 — Seed the red-teaming task bank

Populate the database with the bundled red-teaming tasks:

docker compose exec backend uv run python scripts/populate_dt_source_tasks.py

To also load demo data for a quick UI walkthrough:

docker compose exec backend uv run python scripts/populate_demo_data.py

After seeding, refresh the dashboard at http://localhost:22100 — the task bank should appear in the New Scan wizard.

Connect your agent

The agent under test connects to AgentSuite-Red's MCP proxy at the URL configured in AGENTSUITE_PROXY_MCP_URL. The dashboard's New Scan wizard generates a ready-to-paste config snippet for several frameworks:

Claude Code

{
"mcpServers": {
"virtue-forgingground": {
"type": "sse",
"url": "https://red.acme.internal/forgingground/mcp",
"headers": { "X-API-Key": "<your-api-key>" }
}
}
}

Cursor

{
"mcpServers": {
"virtue-forgingground": {
"url": "https://red.acme.internal/forgingground/mcp",
"headers": { "X-API-Key": "<your-api-key>" }
}
}
}

OpenAI Agents SDK (Python)

from agents import Agent
from agents.mcp import MCPServerSse

mcp_server = MCPServerSse(
params={
"url": "https://red.acme.internal/forgingground/mcp",
"headers": {"X-API-Key": "<your-api-key>"},
},
)
agent = Agent(name="my-agent", mcp_servers=[mcp_server])

Google ADK

from google.adk.tools.mcp_tool import McpToolset
from google.adk.tools.mcp_tool.mcp_session_manager import StreamableHTTPConnectionParams

toolset = McpToolset(
connection_params=StreamableHTTPConnectionParams(
url="https://red.acme.internal/forgingground/mcp",
headers={"X-API-Key": "<your-api-key>"},
),
)

No code changes are needed beyond the MCP client config. The proxy transparently routes tool calls to the right sandbox environment for the running task. Tool-name prefixing (salesforce_search_contacts, gmail_send_email, …) lets a single connection multiplex across all enabled environments.

For the full agent-side contract (HTTP endpoint shape, session handling), see Connect Your Agent.

Running evaluations

Via the dashboard

  1. Open http://localhost:22100 and sign in (bootstrap admin if auth is enabled, or proceed if disabled).
  2. Click New Scan.
  3. Choose target domains (e.g. CRM, Code, Workflow), risk categories (e.g. data exfiltration, malicious code, privilege escalation), and threat models (direct, indirect).
  4. Configure your agent endpoint and paste the MCP URL into your agent's config (see above).
  5. Click Start Scan. The dashboard shows task progress in real time.

For the full UI walkthrough (login → Add Agent → New Scan wizard → results → report), see Run Red-Teaming Scan — the on-prem dashboard behaves identically.

Via the CLI

# Create an evaluation
uv run agentsuite-red evaluation create my-eval

# List available evaluations
uv run agentsuite-red evaluation list

# Run an evaluation
uv run agentsuite-red evaluation run my-eval

# Watch run status
uv run agentsuite-red status <run-id>

# View aggregated stats
uv run agentsuite-red stats <run-id>

Observability

All agent execution traces are recorded in the Trajectories tab on the dashboard. Each task is a separate session with its own session ID. The Trajectories view records:

  • User queries (user role) — the instruction sent to the agent for this task.
  • Agent tool calls (agent role) — tool name and full input parameters.
  • Tool outputs (tool role) — full execution results from the sandbox MCP server.
  • Agent responses (agent role) — the final response returned to the user.

Click View on any step to see detailed metadata, the judge's verdict, and which policy (if any) was violated.

Each task also generates a JSON trajectory under agentsuite_server/data/trajectories/<run_id>/<session_id>.json containing the same payload the dashboard renders.

API reference

All /api/* endpoints require Authorization: Bearer <jwt> (or X-API-Key: <key>); auth is delegated to virtue-auth. Failure responses carry WWW-Authenticate: Bearer error="..." headers (token_expired, invalid_signature, invalid_token, missing_auth, server_misconfigured) for client routing.

The most important endpoints, grouped by responsibility:

Authentication

APIMethodPurpose
/auth-api/api/v1/auth/loginPOSTExchange username + password for an access token bound to a tenant.
/auth-api/api/v1/auth/refreshPOSTRotate the access token (refresh tokens last 7 days).
/api/healthGETLiveness check.

Agent registration

APIMethodPurpose
/api/agentsPOSTRegister an agent endpoint with a friendly name.
/api/agentsGETList all agents in the caller's tenant.
/api/agents/{agent_id}DELETERemove an agent registration.
/api/agents/testPOSTProbe an agent endpoint for reachability before saving.

Evaluation lifecycle

APIMethodPurpose
/api/metadata/summaryGETDiscover available domains / threat models / risk categories / task types.
/api/evaluationsPOSTCreate an evaluation. Materializes one EvaluationTask per matching dataset task.
/api/evaluationsGETList evaluations for the tenant.
/api/evaluations/with-statsGETSame list plus run-count and ASR rollups.
/api/evaluations/{evaluation_id}GETFetch a single evaluation.
/api/evaluations/{evaluation_id}/tasksGETList the configured EvaluationTask rows.
/api/evaluations/{evaluation_id}/sessionsGETList runs for this evaluation.
/api/evaluations/{evaluation_id}DELETESoft-delete an evaluation.

Run a red-teaming scan

Runs are async: create, start, then poll status.

APIMethodPurpose
/api/runsPOSTMaterialize a new Run for an evaluation.
/api/runs/{run_id}/startPOSTKick off task execution via the ForgingGround MCP gateway.
/api/runs/{run_id}/cancelPOSTCancel an in-flight run.
/api/runs/{run_id}GETRun metadata (config, timestamps, error message).
/api/runs/{run_id}/statusGETCheap polling endpoint — counts only.
/api/sessions/{run_id}/statsGETFull stats including per-category / per-domain / per-threat-model breakdowns.
/api/runs/bulk-deletePOSTSoft-delete a batch of runs.

Results & trajectories

APIMethodPurpose
/api/resultsGETList task results, filterable by run / eval / domain / threat model / risk category / task type / status / attack success.
/api/results/{result_id}GETSingle task result with full trajectory, judge metadata, agent responses.
/api/results/{result_id}/trajectoryGETRaw trajectory JSON file.
/api/results/{result_id}PATCHStar/unstar a result for inclusion in the curated report.

Report generation

PDF reports are generated asynchronously and downloaded once ready.

APIMethodPurpose
/api/runs/{run_id}/reportPOSTEnqueue a PDF report job.
/api/runs/report-jobsGETList in-flight report jobs for the tenant.
/api/runs/{run_id}/report/{job_id}GETDownload the rendered PDF (?inline=true for browser preview).
/api/runs/reports/{job_id}GET / DELETEFetch / delete a generated report record.
/api/evaluations/{evaluation_id}/reportsGETAll reports across all runs of an evaluation.

Metadata

APIMethodPurpose
/api/metadata/risk-categoriesGETCanonical RT taxonomy (RT-1 … RT-9).
/api/metadata/threat-modelsGETdirect, indirect, etc.
/api/metadata/domainsGETcrm, workflow, apple-red, code, customer_service, …
/api/metadata/envsGETEnvironments registered with env-server.
/api/metadata/mcp-configGETMCP server URLs to paste into an agent.
/api/metadata/task-facetsGETPer-domain facet counts for filter UI.

Typical end-to-end flow

# 1. Authenticate
POST /auth-api/api/v1/auth/login → access_token

# 2. Register the agent (once)
POST /api/agents { name, endpoint } → agent_id

# 3. Build an evaluation
GET /api/metadata/summary
POST /api/evaluations { name, agent_endpoint, domains, risk_categories, ... } → evaluation_id

# 4. Run it
POST /api/runs { evaluation_id } → run_id
POST /api/runs/{run_id}/start
GET /api/runs/{run_id}/status (poll until status == "completed")

# 5. Inspect results
GET /api/sessions/{run_id}/stats (aggregated breakdowns)
GET /api/results?run_id={run_id} (per-task)
GET /api/results/{result_id} (single trajectory)

# 6. (Optional) curate + report
PATCH /api/results/{result_id} { included_report: true }
POST /api/runs/{run_id}/report { failure_cases_per_category: 3 } → job_id
GET /api/runs/report-jobs (poll until status == "ready")
GET /api/runs/{run_id}/report/{job_id} → PDF