On-Prem Deployment
Deploy Virtue AgentSuite-Red on-premise with Docker Compose. AgentSuite-Red is an automated red-teaming platform for AI agents that runs adversarial evaluations against any HTTP-accessible agent, exercising it through Virtue Agent ForgingGround — a pool of containerized environment sandboxes (Salesforce, Gmail, Slack, Atlassian, GitLab, BigQuery, etc.).
Prerequisites
| Requirement | Details |
|---|---|
| Operating System | Ubuntu 22.04 LTS (or RHEL 9), x86_64 |
| Docker | Docker Engine 24+ with Compose v2 plugin |
| Python | 3.11+ (only on the host that runs gen_compose) |
| uv | Latest — docs.astral.sh/uv |
| Git | 2.40+ with submodule support |
| CPU | 12 cores minimum (16+ recommended) |
| RAM | 24 GB minimum (32 GB recommended) |
| Disk | 500 GB SSD (~30 GB for env images, the rest for trajectories and DB) |
| GPU | Not required |
| Outbound network | Required for first-time image pulls and PyPI dependencies |
Components
All images are hosted in us-docker.pkg.dev/customer-docker-virtueai/agentsuite-red/. There are two categories.
Core services
| Image | Description |
|---|---|
agentsuite-red/backend | FastAPI orchestrator — schedules evaluations, runs the in-process MCP proxy, calls the target agent, persists results. Listens on port 38085. |
agentsuite-red/env-server | Docker-in-Docker env-pool manager — starts/resets sandbox environments and spawns MCP server subprocesses. Listens on port 8091. |
agentsuite-red/frontend | React dashboard served by nginx. Listens on port 22100; reverse-proxies /api/* and /forgingground/mcp to the backend. |
postgres:17 | Single backend database (agentsuite_red). |
Sandbox environment pool
The env-server starts a pre-warmed pool of sandboxed applications. Each environment can run multiple instances in parallel. The default pool (defined in env_server/pool.yaml):
| Domain | Environments | Default count |
|---|---|---|
| CRM | salesforce, gmail | 4, 8 |
| Communication | slack, calendar, zoom, telegram, whatsapp | 4, 4, 4, 4, 8 |
| Code / DevOps | atlassian, terminal, googledocs | 4, 4, 8 |
| Finance | paypal, finance | 4, 4 |
| Customer Service | customer_service | 4 |
| OS | OS-filesystem | 4 |
| Travel | travel-suite | 4 |
| Workflow | bigquery, snowflake, databricks, google-form | 4, 4, 4, 4 |
count is the number of pre-started instances and equals the maximum parallelism for that environment.
Step 1 — Get the code bundle
Extract the bundle delivered by Virtue AI, which contains the deployment files for all images:
unzip agentsuite-red.zip
cd agentsuite-red
Step 2 — Authenticate with the image registry
If you are pulling pre-built images, authenticate to the Virtue AI registry with the GCP service-account key included in the bundle:
docker login -u _json_key --password-stdin https://us-docker.pkg.dev < serviceaccount.json
If you are building images from source (the default docker-compose.yml does this), skip this step.
Step 3 — Configure deployment values
Copy the environment file template and edit it:
cp .env.example .env
$EDITOR .env
Required settings:
| Variable | What to set |
|---|---|
AGENTSUITE_DATABASE_URL | Leave at postgresql+asyncpg://postgres:postgres@localhost:5432/agentsuite_red for single-node. Point at an external Postgres for production. |
AGENTSUITE_PROXY_MCP_URL | The public URL the target agent will use — e.g. https://red.acme.internal/forgingground/mcp. Shown in the UI as the value users paste into their agent config. |
Optional — change if applicable:
| Variable | When to change | Default |
|---|---|---|
AGENTSUITE_MAX_CONCURRENT_TASKS | Raise once the host has observed headroom. | 5 |
AGENTSUITE_AGENT_REQUEST_TIMEOUT | Increase if the target agent is slow. | 600.0 (seconds) |
AGENTSUITE_AUTH_ENABLED | Set true for production. | false |
AGENTSUITE_VIRTUE_AUTH_URL | OIDC issuer URL when auth is enabled. | empty |
AGENTSUITE_JWT_SECRET | HS256 secret if using local JWT instead of OIDC. | empty |
If you enable auth, generate a strong JWT secret:
SECRET=$(openssl rand -hex 32) && echo "JWT secret: $SECRET"
Paste it into AGENTSUITE_JWT_SECRET in .env.
Step 4 — Tune the sandbox pool
env_server/pool.yaml controls which environments are pre-started and at what fan-out. The default file is suitable for a PoC. To change parallelism for a specific environment:
# env_server/pool.yaml
pools:
salesforce:
count: 8 # was 4 — bump to support more parallel CRM tasks
gmail:
count: 8
# ...
Each additional instance costs roughly 50–500 MB of RAM depending on the environment (Salesforce is the heaviest at ~300 MB; Gmail Mailpit ~100 MB).
After editing pool.yaml, the pool compose file is regenerated automatically by start.sh in Step 5. To regenerate manually:
uv run python -m env_server.gen_compose
Step 5 — Deploy the stack
./start.sh
start.sh performs two actions:
- Runs
uv run python -m env_server.gen_composeto regenerateenv_server/pool-compose.ymlfrompool.yaml. - Runs
docker compose up --build -d, which starts PostgreSQL, env-server, backend, frontend, and ~100–140 sandbox environment containers frompool-compose.yml.
Docker pulls ~26 distinct sandbox images and builds three local images. Expect 15–30 minutes on first start. Subsequent restarts complete in 1–2 minutes.
Step 6 — Verify deployment
Check that all four core services are up:
docker compose ps --format "table {{.Service}}\t{{.Status}}" \
| grep -E "^(postgres|env-server|backend|frontend)\b"
Expected:
postgres Up X minutes (healthy)
env-server Up X minutes
backend Up X minutes
frontend Up X minutes
Hit each service's HTTP endpoint:
curl -fsS http://localhost:38085/health # backend
curl -fsS http://localhost:8091/health # env-server
curl -fsS -o /dev/null -w "%{http_code}\n" http://localhost:22100/ # frontend (200)
Local endpoints once everything is up:
| Service | URL |
|---|---|
| Dashboard | http://localhost:22100 |
| Backend API | http://localhost:38085 |
| env-server API | http://localhost:8091 |
| MCP proxy (for the target agent) | http://localhost:22100/forgingground/mcp |
Step 7 — Seed the red-teaming task bank
Populate the database with the bundled red-teaming tasks:
docker compose exec backend uv run python scripts/populate_dt_source_tasks.py
To also load demo data for a quick UI walkthrough:
docker compose exec backend uv run python scripts/populate_demo_data.py
After seeding, refresh the dashboard at http://localhost:22100 — the task bank should appear in the New Scan wizard.
Connect your agent
The agent under test connects to AgentSuite-Red's MCP proxy at the URL configured in AGENTSUITE_PROXY_MCP_URL. The dashboard's New Scan wizard generates a ready-to-paste config snippet for several frameworks:
Claude Code
{
"mcpServers": {
"virtue-forgingground": {
"type": "sse",
"url": "https://red.acme.internal/forgingground/mcp",
"headers": { "X-API-Key": "<your-api-key>" }
}
}
}
Cursor
{
"mcpServers": {
"virtue-forgingground": {
"url": "https://red.acme.internal/forgingground/mcp",
"headers": { "X-API-Key": "<your-api-key>" }
}
}
}
OpenAI Agents SDK (Python)
from agents import Agent
from agents.mcp import MCPServerSse
mcp_server = MCPServerSse(
params={
"url": "https://red.acme.internal/forgingground/mcp",
"headers": {"X-API-Key": "<your-api-key>"},
},
)
agent = Agent(name="my-agent", mcp_servers=[mcp_server])
Google ADK
from google.adk.tools.mcp_tool import McpToolset
from google.adk.tools.mcp_tool.mcp_session_manager import StreamableHTTPConnectionParams
toolset = McpToolset(
connection_params=StreamableHTTPConnectionParams(
url="https://red.acme.internal/forgingground/mcp",
headers={"X-API-Key": "<your-api-key>"},
),
)
No code changes are needed beyond the MCP client config. The proxy transparently routes tool calls to the right sandbox environment for the running task. Tool-name prefixing (salesforce_search_contacts, gmail_send_email, …) lets a single connection multiplex across all enabled environments.
For the full agent-side contract (HTTP endpoint shape, session handling), see Connect Your Agent.
Running evaluations
Via the dashboard
- Open
http://localhost:22100and sign in (bootstrap admin if auth is enabled, or proceed if disabled). - Click New Scan.
- Choose target domains (e.g. CRM, Code, Workflow), risk categories (e.g. data exfiltration, malicious code, privilege escalation), and threat models (direct, indirect).
- Configure your agent endpoint and paste the MCP URL into your agent's config (see above).
- Click Start Scan. The dashboard shows task progress in real time.
For the full UI walkthrough (login → Add Agent → New Scan wizard → results → report), see Run Red-Teaming Scan — the on-prem dashboard behaves identically.
Via the CLI
# Create an evaluation
uv run agentsuite-red evaluation create my-eval
# List available evaluations
uv run agentsuite-red evaluation list
# Run an evaluation
uv run agentsuite-red evaluation run my-eval
# Watch run status
uv run agentsuite-red status <run-id>
# View aggregated stats
uv run agentsuite-red stats <run-id>
Observability
All agent execution traces are recorded in the Trajectories tab on the dashboard. Each task is a separate session with its own session ID. The Trajectories view records:
- User queries (
userrole) — the instruction sent to the agent for this task. - Agent tool calls (
agentrole) — tool name and full input parameters. - Tool outputs (
toolrole) — full execution results from the sandbox MCP server. - Agent responses (
agentrole) — the final response returned to the user.
Click View on any step to see detailed metadata, the judge's verdict, and which policy (if any) was violated.
Each task also generates a JSON trajectory under agentsuite_server/data/trajectories/<run_id>/<session_id>.json containing the same payload the dashboard renders.
API reference
All /api/* endpoints require Authorization: Bearer <jwt> (or X-API-Key: <key>); auth is delegated to virtue-auth. Failure responses carry WWW-Authenticate: Bearer error="..." headers (token_expired, invalid_signature, invalid_token, missing_auth, server_misconfigured) for client routing.
The most important endpoints, grouped by responsibility:
Authentication
| API | Method | Purpose |
|---|---|---|
/auth-api/api/v1/auth/login | POST | Exchange username + password for an access token bound to a tenant. |
/auth-api/api/v1/auth/refresh | POST | Rotate the access token (refresh tokens last 7 days). |
/api/health | GET | Liveness check. |
Agent registration
| API | Method | Purpose |
|---|---|---|
/api/agents | POST | Register an agent endpoint with a friendly name. |
/api/agents | GET | List all agents in the caller's tenant. |
/api/agents/{agent_id} | DELETE | Remove an agent registration. |
/api/agents/test | POST | Probe an agent endpoint for reachability before saving. |
Evaluation lifecycle
| API | Method | Purpose |
|---|---|---|
/api/metadata/summary | GET | Discover available domains / threat models / risk categories / task types. |
/api/evaluations | POST | Create an evaluation. Materializes one EvaluationTask per matching dataset task. |
/api/evaluations | GET | List evaluations for the tenant. |
/api/evaluations/with-stats | GET | Same list plus run-count and ASR rollups. |
/api/evaluations/{evaluation_id} | GET | Fetch a single evaluation. |
/api/evaluations/{evaluation_id}/tasks | GET | List the configured EvaluationTask rows. |
/api/evaluations/{evaluation_id}/sessions | GET | List runs for this evaluation. |
/api/evaluations/{evaluation_id} | DELETE | Soft-delete an evaluation. |
Run a red-teaming scan
Runs are async: create, start, then poll status.
| API | Method | Purpose |
|---|---|---|
/api/runs | POST | Materialize a new Run for an evaluation. |
/api/runs/{run_id}/start | POST | Kick off task execution via the ForgingGround MCP gateway. |
/api/runs/{run_id}/cancel | POST | Cancel an in-flight run. |
/api/runs/{run_id} | GET | Run metadata (config, timestamps, error message). |
/api/runs/{run_id}/status | GET | Cheap polling endpoint — counts only. |
/api/sessions/{run_id}/stats | GET | Full stats including per-category / per-domain / per-threat-model breakdowns. |
/api/runs/bulk-delete | POST | Soft-delete a batch of runs. |
Results & trajectories
| API | Method | Purpose |
|---|---|---|
/api/results | GET | List task results, filterable by run / eval / domain / threat model / risk category / task type / status / attack success. |
/api/results/{result_id} | GET | Single task result with full trajectory, judge metadata, agent responses. |
/api/results/{result_id}/trajectory | GET | Raw trajectory JSON file. |
/api/results/{result_id} | PATCH | Star/unstar a result for inclusion in the curated report. |
Report generation
PDF reports are generated asynchronously and downloaded once ready.
| API | Method | Purpose |
|---|---|---|
/api/runs/{run_id}/report | POST | Enqueue a PDF report job. |
/api/runs/report-jobs | GET | List in-flight report jobs for the tenant. |
/api/runs/{run_id}/report/{job_id} | GET | Download the rendered PDF (?inline=true for browser preview). |
/api/runs/reports/{job_id} | GET / DELETE | Fetch / delete a generated report record. |
/api/evaluations/{evaluation_id}/reports | GET | All reports across all runs of an evaluation. |
Metadata
| API | Method | Purpose |
|---|---|---|
/api/metadata/risk-categories | GET | Canonical RT taxonomy (RT-1 … RT-9). |
/api/metadata/threat-models | GET | direct, indirect, etc. |
/api/metadata/domains | GET | crm, workflow, apple-red, code, customer_service, … |
/api/metadata/envs | GET | Environments registered with env-server. |
/api/metadata/mcp-config | GET | MCP server URLs to paste into an agent. |
/api/metadata/task-facets | GET | Per-domain facet counts for filter UI. |
Typical end-to-end flow
# 1. Authenticate
POST /auth-api/api/v1/auth/login → access_token
# 2. Register the agent (once)
POST /api/agents { name, endpoint } → agent_id
# 3. Build an evaluation
GET /api/metadata/summary
POST /api/evaluations { name, agent_endpoint, domains, risk_categories, ... } → evaluation_id
# 4. Run it
POST /api/runs { evaluation_id } → run_id
POST /api/runs/{run_id}/start
GET /api/runs/{run_id}/status (poll until status == "completed")
# 5. Inspect results
GET /api/sessions/{run_id}/stats (aggregated breakdowns)
GET /api/results?run_id={run_id} (per-task)
GET /api/results/{result_id} (single trajectory)
# 6. (Optional) curate + report
PATCH /api/results/{result_id} { included_report: true }
POST /api/runs/{run_id}/report { failure_cases_per_category: 3 } → job_id
GET /api/runs/report-jobs (poll until status == "ready")
GET /api/runs/{run_id}/report/{job_id} → PDF