Skip to main content

VirtueGuard-Audio Setup Guide

Introduction

VirtueGuard-Audio is our audio safety classifier. It analyzes speech and non-speech audio to detect abusive language, threats, harmful instructions, sensitive topics, and signals of voice-clone or TTS-based misuse — without requiring you to run a separate speech-to-text stage first.

VirtueGuard-Audio is part of the VirtueGuard family and shares its API style with TextGuard Lite, ImageGuard, and VideoGuard.


Risk Categories

VirtueGuard-Audio jointly evaluates the spoken content and the acoustic properties of the clip:

CategoryDescription
Abusive_LanguageProfanity, slurs, or harassing speech
Threats_IntimidationDirect or implied threats of harm
Hate_SpeechDiscriminatory or extremist speech
Sexual_ContentSexual speech or vocalizations
Self_HarmSpeech promoting or describing self-injury or suicide
Criminal_PlanningSpoken instructions facilitating illegal activity
Specialized_Harmful_AdviceDangerous medical, legal, or operational advice
Privacy_PII_DisclosureSpoken disclosure of personal or sensitive information
Voice_Clone_SyntheticIndicators that the speech is AI-generated or cloned
Disturbing_AudioScreams, gunshots, distress signals, or other unsafe non-speech audio

Categories can be customized per deployment.


API Integration

Authentication and Endpoint

All API requests require an API key included in the request headers.

Endpoint: https://api.virtueai.io/api/audiomoderation

Method: POST

Headers:

  • Content-Type: application/json
  • Authorization: Bearer your_api_key_here

Input formats

VirtueGuard-Audio accepts:

  • A URL pointing to the audio (mp3, wav, ogg, flac, m4a).
  • A base64-encoded audio payload for short clips (≤ 60 seconds recommended).
  • A live audio stream (WebSocket PCM, RTP, or SIP/Twilio media stream) for real-time moderation. See Streaming Input below.

For long audio, the service segments the input into windowed chunks; you can override the chunk size per request.

API Request Example

import requests
import base64

url = "https://api.virtueai.io/api/audiomoderation"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer your_api_key_here",
}

# Option A — pass a URL
payload = {
"audio_url": "https://example.com/call.mp3",
"language": "en",
"chunk_seconds": 10,
}

# Option B — pass base64-encoded bytes
# with open("call.mp3", "rb") as f:
# payload = {"audio_b64": base64.b64encode(f.read()).decode(), "language": "en"}

response = requests.post(url, headers=headers, json=payload)
print(response.json())

Response

{
"flag": true,
"duration_seconds": 28.7,
"language": "en",
"summary": {
"categories": {
"Threats_Intimidation": true,
"Voice_Clone_Synthetic": true,
"Hate_Speech": false
},
"probs": {
"Threats_Intimidation": 0.88,
"Voice_Clone_Synthetic": 0.74,
"Hate_Speech": 0.06
}
},
"segments": [
{
"start_seconds": 4.0,
"end_seconds": 14.0,
"transcript": "...",
"categories": { "Threats_Intimidation": true },
"probs": { "Threats_Intimidation": 0.93 }
}
],
"latency_ms": 412
}
FieldDescription
flagtrue if any segment flagged the content.
summary.categories / summary.probsWorst-case aggregated decision and probabilities across the clip.
segments[]Per-chunk decisions with optional transcripts.
languageDetected language (or the one you passed in).
latency_msEnd-to-end classification latency.

Streaming Input

VirtueGuard-Audio also accepts live audio streams and returns moderation events incrementally — useful for call centers, voice agents, live moderation, and SIP/Twilio media-stream pipelines.

Supported transports

TransportUse Case
WebSocket (PCM)Push 16-bit PCM (or μ-law) frames over a WebSocket. Best fit for browser and custom clients.
Twilio Media StreamsBidirectional WebSocket payloads in Twilio's media envelope — drop-in for <Connect><Stream>.
SIP / RTPConnect to a SIPREC fork or an RTP feed for telephony pipelines.
gRPC bidi streamServer-to-server low-latency integration.

Endpoint

POST /api/audiomoderation/streams to create a session, then push audio to the returned WebSocket URL (or have your telephony platform connect to it).

curl -X POST "https://api.virtueai.io/api/audiomoderation/streams" \
-H "Authorization: Bearer your_api_key_here" \
-H "Content-Type: application/json" \
-d '{
"transport": "websocket-pcm",
"sample_rate": 16000,
"encoding": "pcm16",
"language": "en",
"chunk_seconds": 5,
"callback_url": "https://your-app.example.com/policyguard-events"
}'

Response:

{
"stream_id": "as_01HX...",
"status": "running",
"websocket_url": "wss://api.virtueai.io/api/audiomoderation/streams/as_01HX.../audio"
}

Minimal browser client

const ws = new WebSocket(websocket_url);
ws.binaryType = "arraybuffer";
mediaRecorder.ondataavailable = (e) => ws.send(e.data);

ws.onmessage = (event) => {
const result = JSON.parse(event.data);
if (result.flag) {
console.warn(`Blocked at ${result.timestamp_seconds}s: ${Object.keys(result.categories).filter(k => result.categories[k])}`);
mediaRecorder.stop();
}
};

Event stream

Each moderation decision is emitted as a JSON event on the same WebSocket (or POSTed to your callback_url):

{
"stream_id": "as_01HX...",
"type": "segment",
"start_seconds": 4.0,
"end_seconds": 9.0,
"transcript": "...",
"flag": true,
"categories": { "Threats_Intimidation": true },
"probs": { "Threats_Intimidation": 0.93 },
"latency_ms": 84
}

type is one of segment, summary, or error. Segment events fire at the end of each chunk window; a summary event is emitted when you close the stream.

Stop a session with DELETE /api/audiomoderation/streams/{stream_id}.


Common Integration Patterns

  • Call center / voice agent — stream live agent and caller audio over WebSocket or Twilio Media Streams; receive segment events in real time and route flagged calls to a human reviewer.
  • Voice assistant — check synthesized TTS output before playback; block responses that violate policy.
  • Voice clone / fraud detection — combine Voice_Clone_Synthetic signal with your KYC or anti-fraud pipeline.
  • Podcast / UGC moderation — scan uploads or generated audio before publishing.
  • Multimodal combo — pair with VideoGuard for full audio + video moderation, and PolicyGuard to express the block/allow rules declaratively.