VirtueGuard-Audio Setup Guide

Introduction

VirtueGuard-Audio is our audio safety classifier. It analyzes speech and non-speech audio to detect abusive language, threats, harmful instructions, sensitive topics, and signals of voice-clone or TTS-based misuse — without requiring you to run a separate speech-to-text stage first.

VirtueGuard-Audio is part of the VirtueGuard family and shares its API style with TextGuard Lite, ImageGuard, and VideoGuard.

Risk Categories

VirtueGuard-Audio jointly evaluates the spoken content and the acoustic properties of the clip:

Category	Description
Abusive_Language	Profanity, slurs, or harassing speech
Threats_Intimidation	Direct or implied threats of harm
Hate_Speech	Discriminatory or extremist speech
Sexual_Content	Sexual speech or vocalizations
Self_Harm	Speech promoting or describing self-injury or suicide
Criminal_Planning	Spoken instructions facilitating illegal activity
Specialized_Harmful_Advice	Dangerous medical, legal, or operational advice
Privacy_PII_Disclosure	Spoken disclosure of personal or sensitive information
Voice_Clone_Synthetic	Indicators that the speech is AI-generated or cloned
Disturbing_Audio	Screams, gunshots, distress signals, or other unsafe non-speech audio

Categories can be customized per deployment.

API Integration

Authentication and Endpoint

All API requests require an API key included in the request headers.

Endpoint: https://api.virtueai.io/api/audiomoderation

Method: POST

Headers:

Content-Type: application/json
Authorization: Bearer your_api_key_here

Input formats

VirtueGuard-Audio accepts:

A URL pointing to the audio (mp3, wav, ogg, flac, m4a).
A base64-encoded audio payload for short clips (≤ 60 seconds recommended).
A live audio stream (WebSocket PCM, RTP, or SIP/Twilio media stream) for real-time moderation. See Streaming Input below.

For long audio, the service segments the input into windowed chunks; you can override the chunk size per request.

API Request Example

Python
cURL

import requests
import base64

url = "https://api.virtueai.io/api/audiomoderation"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer your_api_key_here",
}

# Option A — pass a URL
payload = {
    "audio_url": "https://example.com/call.mp3",
    "language": "en",
    "chunk_seconds": 10,
}

# Option B — pass base64-encoded bytes
# with open("call.mp3", "rb") as f:
#     payload = {"audio_b64": base64.b64encode(f.read()).decode(), "language": "en"}

response = requests.post(url, headers=headers, json=payload)
print(response.json())

curl -X POST "https://api.virtueai.io/api/audiomoderation" \
  -H "Authorization: Bearer your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://example.com/call.mp3",
    "language": "en",
    "chunk_seconds": 10
  }'

Response

{
  "flag": true,
  "duration_seconds": 28.7,
  "language": "en",
  "summary": {
    "categories": {
      "Threats_Intimidation": true,
      "Voice_Clone_Synthetic": true,
      "Hate_Speech": false
    },
    "probs": {
      "Threats_Intimidation": 0.88,
      "Voice_Clone_Synthetic": 0.74,
      "Hate_Speech": 0.06
    }
  },
  "segments": [
    {
      "start_seconds": 4.0,
      "end_seconds": 14.0,
      "transcript": "...",
      "categories": { "Threats_Intimidation": true },
      "probs":      { "Threats_Intimidation": 0.93 }
    }
  ],
  "latency_ms": 412
}

Field	Description
`flag`	`true` if any segment flagged the content.
`summary.categories` / `summary.probs`	Worst-case aggregated decision and probabilities across the clip.
`segments[]`	Per-chunk decisions with optional transcripts.
`language`	Detected language (or the one you passed in).
`latency_ms`	End-to-end classification latency.

Streaming Input

VirtueGuard-Audio also accepts live audio streams and returns moderation events incrementally — useful for call centers, voice agents, live moderation, and SIP/Twilio media-stream pipelines.

Supported transports

Transport	Use Case
WebSocket (PCM)	Push 16-bit PCM (or μ-law) frames over a WebSocket. Best fit for browser and custom clients.
Twilio Media Streams	Bidirectional WebSocket payloads in Twilio's `media` envelope — drop-in for `<Connect><Stream>`.
SIP / RTP	Connect to a SIPREC fork or an RTP feed for telephony pipelines.
gRPC bidi stream	Server-to-server low-latency integration.

Endpoint

POST /api/audiomoderation/streams to create a session, then push audio to the returned WebSocket URL (or have your telephony platform connect to it).

curl -X POST "https://api.virtueai.io/api/audiomoderation/streams" \
  -H "Authorization: Bearer your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "transport": "websocket-pcm",
    "sample_rate": 16000,
    "encoding": "pcm16",
    "language": "en",
    "chunk_seconds": 5,
    "callback_url": "https://your-app.example.com/policyguard-events"
  }'

Response:

{
  "stream_id": "as_01HX...",
  "status": "running",
  "websocket_url": "wss://api.virtueai.io/api/audiomoderation/streams/as_01HX.../audio"
}

Minimal browser client

const ws = new WebSocket(websocket_url);
ws.binaryType = "arraybuffer";
mediaRecorder.ondataavailable = (e) => ws.send(e.data);

ws.onmessage = (event) => {
  const result = JSON.parse(event.data);
  if (result.flag) {
    console.warn(`Blocked at ${result.timestamp_seconds}s: ${Object.keys(result.categories).filter(k => result.categories[k])}`);
    mediaRecorder.stop();
  }
};

Event stream

Each moderation decision is emitted as a JSON event on the same WebSocket (or POSTed to your callback_url):

{
  "stream_id": "as_01HX...",
  "type": "segment",
  "start_seconds": 4.0,
  "end_seconds": 9.0,
  "transcript": "...",
  "flag": true,
  "categories": { "Threats_Intimidation": true },
  "probs":      { "Threats_Intimidation": 0.93 },
  "latency_ms": 84
}

type is one of segment, summary, or error. Segment events fire at the end of each chunk window; a summary event is emitted when you close the stream.

Stop a session with DELETE /api/audiomoderation/streams/{stream_id}.

Common Integration Patterns

Call center / voice agent — stream live agent and caller audio over WebSocket or Twilio Media Streams; receive segment events in real time and route flagged calls to a human reviewer.
Voice assistant — check synthesized TTS output before playback; block responses that violate policy.
Voice clone / fraud detection — combine Voice_Clone_Synthetic signal with your KYC or anti-fraud pipeline.
Podcast / UGC moderation — scan uploads or generated audio before publishing.
Multimodal combo — pair with VideoGuard for full audio + video moderation, and PolicyGuard to express the block/allow rules declaratively.

Introduction​

Risk Categories​

API Integration​

Authentication and Endpoint​

Input formats​

API Request Example​

Response​

Streaming Input​

Supported transports​

Endpoint​

Minimal browser client​

Event stream​

Common Integration Patterns​