On-Prem Deployment

Deploy Virtue AgentSuite-Red on-premise with Docker Compose. AgentSuite-Red is an automated red-teaming platform for AI agents that runs adversarial evaluations against any HTTP-accessible agent, exercising it through Virtue Agent ForgingGround — a pool of containerized environment sandboxes (Salesforce, Gmail, Slack, Atlassian, GitLab, BigQuery, etc.).

Prerequisites

Requirement	Details
Operating System	Ubuntu 22.04 LTS (or RHEL 9), x86_64
Docker	Docker Engine 24+ with Compose v2 plugin
Python	3.11+ (only on the host that runs `gen_compose`)
uv	Latest — docs.astral.sh/uv
Git	2.40+ with submodule support
CPU	12 cores minimum (16+ recommended)
RAM	24 GB minimum (32 GB recommended)
Disk	500 GB SSD (~30 GB for env images, the rest for trajectories and DB)
GPU	Not required
Outbound network	Required for first-time image pulls and PyPI dependencies

Components

All images are hosted in us-docker.pkg.dev/customer-docker-virtueai/agentsuite-red/. There are two categories.

Core services

Image	Description
`agentsuite-red/backend`	FastAPI orchestrator — schedules evaluations, runs the in-process MCP proxy, calls the target agent, persists results. Listens on port `38085`.
`agentsuite-red/env-server`	Docker-in-Docker env-pool manager — starts/resets sandbox environments and spawns MCP server subprocesses. Listens on port `8091`.
`agentsuite-red/frontend`	React dashboard served by nginx. Listens on port `22100`; reverse-proxies `/api/*` and `/forgingground/mcp` to the backend.
`postgres:17`	Single backend database (`agentsuite_red`).

Sandbox environment pool

The env-server starts a pre-warmed pool of sandboxed applications. Each environment can run multiple instances in parallel. The default pool (defined in env_server/pool.yaml):

Domain	Environments	Default count
CRM	salesforce, gmail	4, 8
Communication	slack, calendar, zoom, telegram, whatsapp	4, 4, 4, 4, 8
Code / DevOps	atlassian, terminal, googledocs	4, 4, 8
Finance	paypal, finance	4, 4
Customer Service	customer_service	4
OS	OS-filesystem	4
Travel	travel-suite	4
Workflow	bigquery, snowflake, databricks, google-form	4, 4, 4, 4

count is the number of pre-started instances and equals the maximum parallelism for that environment.

Step 1 — Get the code bundle

Extract the bundle delivered by Virtue AI, which contains the deployment files for all images:

unzip agentsuite-red.zip
cd agentsuite-red

Step 2 — Authenticate with the image registry

If you are pulling pre-built images, authenticate to the Virtue AI registry with the GCP service-account key included in the bundle:

docker login -u _json_key --password-stdin https://us-docker.pkg.dev < serviceaccount.json

If you are building images from source (the default docker-compose.yml does this), skip this step.

Step 3 — Configure deployment values

Copy the environment file template and edit it:

cp .env.example .env
$EDITOR .env

Required settings:

Variable	What to set
`AGENTSUITE_DATABASE_URL`	Leave at `postgresql+asyncpg://postgres:postgres@localhost:5432/agentsuite_red` for single-node. Point at an external Postgres for production.
`AGENTSUITE_PROXY_MCP_URL`	The public URL the target agent will use — e.g. `https://red.acme.internal/forgingground/mcp`. Shown in the UI as the value users paste into their agent config.

Optional — change if applicable:

Variable	When to change	Default
`AGENTSUITE_MAX_CONCURRENT_TASKS`	Raise once the host has observed headroom.	`5`
`AGENTSUITE_AGENT_REQUEST_TIMEOUT`	Increase if the target agent is slow.	`600.0` (seconds)
`AGENTSUITE_AUTH_ENABLED`	Set `true` for production.	`false`
`AGENTSUITE_VIRTUE_AUTH_URL`	OIDC issuer URL when auth is enabled.	empty
`AGENTSUITE_JWT_SECRET`	HS256 secret if using local JWT instead of OIDC.	empty

If you enable auth, generate a strong JWT secret:

SECRET=$(openssl rand -hex 32) && echo "JWT secret: $SECRET"

Paste it into AGENTSUITE_JWT_SECRET in .env.

Step 4 — Tune the sandbox pool

env_server/pool.yaml controls which environments are pre-started and at what fan-out. The default file is suitable for a PoC. To change parallelism for a specific environment:

# env_server/pool.yaml
pools:
  salesforce:
    count: 8        # was 4 — bump to support more parallel CRM tasks
  gmail:
    count: 8
  # ...

Each additional instance costs roughly 50–500 MB of RAM depending on the environment (Salesforce is the heaviest at ~300 MB; Gmail Mailpit ~100 MB).

After editing pool.yaml, the pool compose file is regenerated automatically by start.sh in Step 5. To regenerate manually:

uv run python -m env_server.gen_compose

Step 5 — Deploy the stack

./start.sh

start.sh performs two actions:

Runs uv run python -m env_server.gen_compose to regenerate env_server/pool-compose.yml from pool.yaml.
Runs docker compose up --build -d, which starts PostgreSQL, env-server, backend, frontend, and ~100–140 sandbox environment containers from pool-compose.yml.

First start is slow

Docker pulls ~26 distinct sandbox images and builds three local images. Expect 15–30 minutes on first start. Subsequent restarts complete in 1–2 minutes.

Step 6 — Verify deployment

Check that all four core services are up:

docker compose ps --format "table {{.Service}}\t{{.Status}}" \
  | grep -E "^(postgres|env-server|backend|frontend)\b"

Expected:

postgres     Up X minutes (healthy)
env-server   Up X minutes
backend      Up X minutes
frontend     Up X minutes

Hit each service's HTTP endpoint:

curl -fsS http://localhost:38085/health        # backend
curl -fsS http://localhost:8091/health         # env-server
curl -fsS -o /dev/null -w "%{http_code}\n" http://localhost:22100/   # frontend (200)

Local endpoints once everything is up:

Service	URL
Dashboard	`http://localhost:22100`
Backend API	`http://localhost:38085`
env-server API	`http://localhost:8091`
MCP proxy (for the target agent)	`http://localhost:22100/forgingground/mcp`

Step 7 — Seed the red-teaming task bank

Populate the database with the bundled red-teaming tasks:

docker compose exec backend uv run python scripts/populate_dt_source_tasks.py

To also load demo data for a quick UI walkthrough:

docker compose exec backend uv run python scripts/populate_demo_data.py

After seeding, refresh the dashboard at http://localhost:22100 — the task bank should appear in the New Scan wizard.

Connect your agent

The agent under test connects to AgentSuite-Red's MCP proxy at the URL configured in AGENTSUITE_PROXY_MCP_URL. The dashboard's New Scan wizard generates a ready-to-paste config snippet for several frameworks:

Claude Code

{
  "mcpServers": {
    "virtue-forgingground": {
      "type": "sse",
      "url": "https://red.acme.internal/forgingground/mcp",
      "headers": { "X-API-Key": "<your-api-key>" }
    }
  }
}

Cursor

{
  "mcpServers": {
    "virtue-forgingground": {
      "url": "https://red.acme.internal/forgingground/mcp",
      "headers": { "X-API-Key": "<your-api-key>" }
    }
  }
}

OpenAI Agents SDK (Python)

from agents import Agent
from agents.mcp import MCPServerSse

mcp_server = MCPServerSse(
    params={
        "url": "https://red.acme.internal/forgingground/mcp",
        "headers": {"X-API-Key": "<your-api-key>"},
    },
)
agent = Agent(name="my-agent", mcp_servers=[mcp_server])

Google ADK

from google.adk.tools.mcp_tool import McpToolset
from google.adk.tools.mcp_tool.mcp_session_manager import StreamableHTTPConnectionParams

toolset = McpToolset(
    connection_params=StreamableHTTPConnectionParams(
        url="https://red.acme.internal/forgingground/mcp",
        headers={"X-API-Key": "<your-api-key>"},
    ),
)

No code changes are needed beyond the MCP client config. The proxy transparently routes tool calls to the right sandbox environment for the running task. Tool-name prefixing (salesforce_search_contacts, gmail_send_email, …) lets a single connection multiplex across all enabled environments.

For the full agent-side contract (HTTP endpoint shape, session handling), see Connect Your Agent.

Running evaluations

Via the dashboard

Open http://localhost:22100 and sign in (bootstrap admin if auth is enabled, or proceed if disabled).
Click New Scan.
Choose target domains (e.g. CRM, Code, Workflow), risk categories (e.g. data exfiltration, malicious code, privilege escalation), and threat models (direct, indirect).
Configure your agent endpoint and paste the MCP URL into your agent's config (see above).
Click Start Scan. The dashboard shows task progress in real time.

For the full UI walkthrough (login → Add Agent → New Scan wizard → results → report), see Run Red-Teaming Scan — the on-prem dashboard behaves identically.

Via the CLI

# Create an evaluation
uv run agentsuite-red evaluation create my-eval

# List available evaluations
uv run agentsuite-red evaluation list

# Run an evaluation
uv run agentsuite-red evaluation run my-eval

# Watch run status
uv run agentsuite-red status <run-id>

# View aggregated stats
uv run agentsuite-red stats <run-id>

Observability

All agent execution traces are recorded in the Trajectories tab on the dashboard. Each task is a separate session with its own session ID. The Trajectories view records:

User queries (user role) — the instruction sent to the agent for this task.
Agent tool calls (agent role) — tool name and full input parameters.
Tool outputs (tool role) — full execution results from the sandbox MCP server.
Agent responses (agent role) — the final response returned to the user.

Click View on any step to see detailed metadata, the judge's verdict, and which policy (if any) was violated.

Each task also generates a JSON trajectory under agentsuite_server/data/trajectories/<run_id>/<session_id>.json containing the same payload the dashboard renders.

API reference

All /api/* endpoints require Authorization: Bearer <jwt> (or X-API-Key: <key>); auth is delegated to virtue-auth. Failure responses carry WWW-Authenticate: Bearer error="..." headers (token_expired, invalid_signature, invalid_token, missing_auth, server_misconfigured) for client routing.

The most important endpoints, grouped by responsibility:

Authentication

API	Method	Purpose
`/auth-api/api/v1/auth/login`	POST	Exchange username + password for an access token bound to a tenant.
`/auth-api/api/v1/auth/refresh`	POST	Rotate the access token (refresh tokens last 7 days).
`/api/health`	GET	Liveness check.

Agent registration

API	Method	Purpose
`/api/agents`	POST	Register an agent endpoint with a friendly name.
`/api/agents`	GET	List all agents in the caller's tenant.
`/api/agents/{agent_id}`	DELETE	Remove an agent registration.
`/api/agents/test`	POST	Probe an agent endpoint for reachability before saving.

Evaluation lifecycle

API	Method	Purpose
`/api/metadata/summary`	GET	Discover available domains / threat models / risk categories / task types.
`/api/evaluations`	POST	Create an evaluation. Materializes one `EvaluationTask` per matching dataset task.
`/api/evaluations`	GET	List evaluations for the tenant.
`/api/evaluations/with-stats`	GET	Same list plus run-count and ASR rollups.
`/api/evaluations/{evaluation_id}`	GET	Fetch a single evaluation.
`/api/evaluations/{evaluation_id}/tasks`	GET	List the configured `EvaluationTask` rows.
`/api/evaluations/{evaluation_id}/sessions`	GET	List runs for this evaluation.
`/api/evaluations/{evaluation_id}`	DELETE	Soft-delete an evaluation.

Run a red-teaming scan

Runs are async: create, start, then poll status.

API	Method	Purpose
`/api/runs`	POST	Materialize a new Run for an evaluation.
`/api/runs/{run_id}/start`	POST	Kick off task execution via the ForgingGround MCP gateway.
`/api/runs/{run_id}/cancel`	POST	Cancel an in-flight run.
`/api/runs/{run_id}`	GET	Run metadata (config, timestamps, error message).
`/api/runs/{run_id}/status`	GET	Cheap polling endpoint — counts only.
`/api/sessions/{run_id}/stats`	GET	Full stats including per-category / per-domain / per-threat-model breakdowns.
`/api/runs/bulk-delete`	POST	Soft-delete a batch of runs.

Results & trajectories

API	Method	Purpose
`/api/results`	GET	List task results, filterable by run / eval / domain / threat model / risk category / task type / status / attack success.
`/api/results/{result_id}`	GET	Single task result with full trajectory, judge metadata, agent responses.
`/api/results/{result_id}/trajectory`	GET	Raw trajectory JSON file.
`/api/results/{result_id}`	PATCH	Star/unstar a result for inclusion in the curated report.

Report generation

PDF reports are generated asynchronously and downloaded once ready.

API	Method	Purpose
`/api/runs/{run_id}/report`	POST	Enqueue a PDF report job.
`/api/runs/report-jobs`	GET	List in-flight report jobs for the tenant.
`/api/runs/{run_id}/report/{job_id}`	GET	Download the rendered PDF (`?inline=true` for browser preview).
`/api/runs/reports/{job_id}`	GET / DELETE	Fetch / delete a generated report record.
`/api/evaluations/{evaluation_id}/reports`	GET	All reports across all runs of an evaluation.

Metadata

API	Method	Purpose
`/api/metadata/risk-categories`	GET	Canonical RT taxonomy (RT-1 … RT-9).
`/api/metadata/threat-models`	GET	`direct`, `indirect`, etc.
`/api/metadata/domains`	GET	`crm`, `workflow`, `apple-red`, `code`, `customer_service`, …
`/api/metadata/envs`	GET	Environments registered with env-server.
`/api/metadata/mcp-config`	GET	MCP server URLs to paste into an agent.
`/api/metadata/task-facets`	GET	Per-domain facet counts for filter UI.

Typical end-to-end flow

# 1. Authenticate
POST /auth-api/api/v1/auth/login     → access_token

# 2. Register the agent (once)
POST /api/agents                     { name, endpoint } → agent_id

# 3. Build an evaluation
GET  /api/metadata/summary
POST /api/evaluations                { name, agent_endpoint, domains, risk_categories, ... } → evaluation_id

# 4. Run it
POST /api/runs                       { evaluation_id } → run_id
POST /api/runs/{run_id}/start
GET  /api/runs/{run_id}/status       (poll until status == "completed")

# 5. Inspect results
GET  /api/sessions/{run_id}/stats    (aggregated breakdowns)
GET  /api/results?run_id={run_id}    (per-task)
GET  /api/results/{result_id}        (single trajectory)

# 6. (Optional) curate + report
PATCH /api/results/{result_id}       { included_report: true }
POST  /api/runs/{run_id}/report      { failure_cases_per_category: 3 } → job_id
GET   /api/runs/report-jobs          (poll until status == "ready")
GET   /api/runs/{run_id}/report/{job_id}   → PDF

Prerequisites​

Components​

Core services​

Sandbox environment pool​

Step 1 — Get the code bundle​

Step 2 — Authenticate with the image registry​

Step 3 — Configure deployment values​

Step 4 — Tune the sandbox pool​

Step 5 — Deploy the stack​

Step 6 — Verify deployment​

Step 7 — Seed the red-teaming task bank​

Connect your agent​

Claude Code​

Cursor​

OpenAI Agents SDK (Python)​

Google ADK​

Running evaluations​

Via the dashboard​

Via the CLI​

Observability​

API reference​

Authentication​

Agent registration​

Evaluation lifecycle​

Run a red-teaming scan​

Results & trajectories​

Report generation​

Metadata​

Typical end-to-end flow​