VirtueGuard-Audio Setup Guide
Introduction
VirtueGuard-Audio is our audio safety classifier. It analyzes speech and non-speech audio to detect abusive language, threats, harmful instructions, sensitive topics, and signals of voice-clone or TTS-based misuse — without requiring you to run a separate speech-to-text stage first.
VirtueGuard-Audio is part of the VirtueGuard family and shares its API style with TextGuard Lite, ImageGuard, and VideoGuard.
Risk Categories
VirtueGuard-Audio jointly evaluates the spoken content and the acoustic properties of the clip:
| Category | Description |
|---|---|
| Abusive_Language | Profanity, slurs, or harassing speech |
| Threats_Intimidation | Direct or implied threats of harm |
| Hate_Speech | Discriminatory or extremist speech |
| Sexual_Content | Sexual speech or vocalizations |
| Self_Harm | Speech promoting or describing self-injury or suicide |
| Criminal_Planning | Spoken instructions facilitating illegal activity |
| Specialized_Harmful_Advice | Dangerous medical, legal, or operational advice |
| Privacy_PII_Disclosure | Spoken disclosure of personal or sensitive information |
| Voice_Clone_Synthetic | Indicators that the speech is AI-generated or cloned |
| Disturbing_Audio | Screams, gunshots, distress signals, or other unsafe non-speech audio |
Categories can be customized per deployment.
API Integration
Authentication and Endpoint
All API requests require an API key included in the request headers.
Endpoint: https://api.virtueai.io/api/audiomoderation
Method: POST
Headers:
Content-Type: application/jsonAuthorization: Bearer your_api_key_here
Input formats
VirtueGuard-Audio accepts:
- A URL pointing to the audio (
mp3,wav,ogg,flac,m4a). - A base64-encoded audio payload for short clips (≤ 60 seconds recommended).
- A live audio stream (WebSocket PCM, RTP, or SIP/Twilio media stream) for real-time moderation. See Streaming Input below.
For long audio, the service segments the input into windowed chunks; you can override the chunk size per request.
API Request Example
- Python
- cURL
import requests
import base64
url = "https://api.virtueai.io/api/audiomoderation"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer your_api_key_here",
}
# Option A — pass a URL
payload = {
"audio_url": "https://example.com/call.mp3",
"language": "en",
"chunk_seconds": 10,
}
# Option B — pass base64-encoded bytes
# with open("call.mp3", "rb") as f:
# payload = {"audio_b64": base64.b64encode(f.read()).decode(), "language": "en"}
response = requests.post(url, headers=headers, json=payload)
print(response.json())
curl -X POST "https://api.virtueai.io/api/audiomoderation" \
-H "Authorization: Bearer your_api_key_here" \
-H "Content-Type: application/json" \
-d '{
"audio_url": "https://example.com/call.mp3",
"language": "en",
"chunk_seconds": 10
}'
Response
{
"flag": true,
"duration_seconds": 28.7,
"language": "en",
"summary": {
"categories": {
"Threats_Intimidation": true,
"Voice_Clone_Synthetic": true,
"Hate_Speech": false
},
"probs": {
"Threats_Intimidation": 0.88,
"Voice_Clone_Synthetic": 0.74,
"Hate_Speech": 0.06
}
},
"segments": [
{
"start_seconds": 4.0,
"end_seconds": 14.0,
"transcript": "...",
"categories": { "Threats_Intimidation": true },
"probs": { "Threats_Intimidation": 0.93 }
}
],
"latency_ms": 412
}
| Field | Description |
|---|---|
flag | true if any segment flagged the content. |
summary.categories / summary.probs | Worst-case aggregated decision and probabilities across the clip. |
segments[] | Per-chunk decisions with optional transcripts. |
language | Detected language (or the one you passed in). |
latency_ms | End-to-end classification latency. |
Streaming Input
VirtueGuard-Audio also accepts live audio streams and returns moderation events incrementally — useful for call centers, voice agents, live moderation, and SIP/Twilio media-stream pipelines.
Supported transports
| Transport | Use Case |
|---|---|
| WebSocket (PCM) | Push 16-bit PCM (or μ-law) frames over a WebSocket. Best fit for browser and custom clients. |
| Twilio Media Streams | Bidirectional WebSocket payloads in Twilio's media envelope — drop-in for <Connect><Stream>. |
| SIP / RTP | Connect to a SIPREC fork or an RTP feed for telephony pipelines. |
| gRPC bidi stream | Server-to-server low-latency integration. |
Endpoint
POST /api/audiomoderation/streams to create a session, then push audio
to the returned WebSocket URL (or have your telephony platform connect to it).
curl -X POST "https://api.virtueai.io/api/audiomoderation/streams" \
-H "Authorization: Bearer your_api_key_here" \
-H "Content-Type: application/json" \
-d '{
"transport": "websocket-pcm",
"sample_rate": 16000,
"encoding": "pcm16",
"language": "en",
"chunk_seconds": 5,
"callback_url": "https://your-app.example.com/policyguard-events"
}'
Response:
{
"stream_id": "as_01HX...",
"status": "running",
"websocket_url": "wss://api.virtueai.io/api/audiomoderation/streams/as_01HX.../audio"
}
Minimal browser client
const ws = new WebSocket(websocket_url);
ws.binaryType = "arraybuffer";
mediaRecorder.ondataavailable = (e) => ws.send(e.data);
ws.onmessage = (event) => {
const result = JSON.parse(event.data);
if (result.flag) {
console.warn(`Blocked at ${result.timestamp_seconds}s: ${Object.keys(result.categories).filter(k => result.categories[k])}`);
mediaRecorder.stop();
}
};
Event stream
Each moderation decision is emitted as a JSON event on the same WebSocket
(or POSTed to your callback_url):
{
"stream_id": "as_01HX...",
"type": "segment",
"start_seconds": 4.0,
"end_seconds": 9.0,
"transcript": "...",
"flag": true,
"categories": { "Threats_Intimidation": true },
"probs": { "Threats_Intimidation": 0.93 },
"latency_ms": 84
}
type is one of segment, summary, or error. Segment events fire at
the end of each chunk window; a summary event is emitted when you close
the stream.
Stop a session with DELETE /api/audiomoderation/streams/{stream_id}.
Common Integration Patterns
- Call center / voice agent — stream live agent and caller audio over
WebSocket or Twilio Media Streams; receive
segmentevents in real time and route flagged calls to a human reviewer. - Voice assistant — check synthesized TTS output before playback; block responses that violate policy.
- Voice clone / fraud detection — combine
Voice_Clone_Syntheticsignal with your KYC or anti-fraud pipeline. - Podcast / UGC moderation — scan uploads or generated audio before publishing.
- Multimodal combo — pair with VideoGuard for full audio + video moderation, and PolicyGuard to express the block/allow rules declaratively.