Overview
SonioxSTTService
is a speech-to-text (STT) service that integrates with Soniox’s WebSocket API to provide real-time transcription capabilities. It processes audio input and produces transcription frames and interim transcription frames in real time, supporting over 60 languages. Supports custom context, multiple languages in same conversation and more.
API Reference
Complete API documentation
Soniox Docs
Official Soniox documentation
Example Code
Working example with interruption handling
Installation
To useSonioxSTTService
, you need to install the Soniox dependencies:
SONIOX_API_KEY
.
You can obtain a Soniox API key by signing up at Soniox
console.
Frames
Input
By default the service processes raw audio data with the following requirements:- PCM audio format
- 16-bit depth
- 16kHz sample rate
- Single channel
Output
The service produces two types of frames during transcription:TranscriptionFrame
- Final transcription resultsInterimTranscriptionFrame
- Real-time transcription updatesErrorFrame
- Connection or processing errors
Advanced Features
Language Hints
There is no need to pre-select a language — the model automatically detects and transcribes any supported language. It also handles multilingual audio seamlessly, even when multiple languages are mixed within a single sentence or conversation. However, when you have prior knowledge of the languages likely to be spoken in your audio, you can use language hints to guide the model toward those languages for even greater recognition accuracy.Language.EN_GB
will be treated same as Language.EN
. See Supported Languages for a list of supported languages.
You can learn more about language hints in the Soniox documentation.
Customization with Context
By providing context, you help the AI model better understand and anticipate the language in your audio - even if some terms do not appear clearly or completely.Endpoint Detection and VAD
TheSonioxSTTService
processes your speech and has two ways of knowing when to finalize the text.
Automatic Pause Detection
By default, the service listens for natural pauses in your speech. When it detects that you’ve likely finished a sentence, it finalizes the transcription. You can learn more about Endpoint Detection in Soniox documentation.Using Voice Activity Detection (VAD)
For more explicit control, you can use a dedicated Voice Activity Detection (VAD) component within your Pipecat pipeline. The VAD’s job is to detect when a user has completely stopped talking. To enable this behavior, setvad_force_turn_endpoint
to True
. This will disable the automatic endpoint detection and force the service to return transcription results as soon as the user stops talking.