Soniox

Overview

SonioxSTTService is a speech-to-text (STT) service that integrates with Soniox’s WebSocket API to provide real-time transcription capabilities. It processes audio input and produces transcription frames and interim transcription frames in real time, supporting over 60 languages. Supports custom context, multiple languages in same conversation and more.

API Reference

Complete API documentation

Soniox Docs

Official Soniox documentation

Example Code

Working example with interruption handling

Installation

To use SonioxSTTService, you need to install the Soniox dependencies:

pip install "pipecat-ai[soniox]"

You’ll also need to set up your Soniox API key as an environment variable: SONIOX_API_KEY.

You can obtain a Soniox API key by signing up at Soniox console.

Frames

Input

By default the service processes raw audio data with the following requirements:

PCM audio format
16-bit depth
16kHz sample rate
Single channel

Other audio formats, custom sample rates and number of channels are supported. Refer to Audio formats for more information.

Output

The service produces two types of frames during transcription:

TranscriptionFrame - Final transcription results
InterimTranscriptionFrame - Real-time transcription updates
ErrorFrame - Connection or processing errors

Advanced Features

Language Hints

There is no need to pre-select a language — the model automatically detects and transcribes any supported language. It also handles multilingual audio seamlessly, even when multiple languages are mixed within a single sentence or conversation. However, when you have prior knowledge of the languages likely to be spoken in your audio, you can use language hints to guide the model toward those languages for even greater recognition accuracy.

from pipecat.services.soniox import SonioxInputParams
from pipecat.transcriptions.language import Language

SonioxInputParams(
  language_hints=[Language.EN, Language.ES, Language.JA, Language.ZH],
)

Language variants are ignored, for example Language.EN_GB will be treated same as Language.EN. See Supported Languages for a list of supported languages. You can learn more about language hints in the Soniox documentation.

Customization with Context

By providing context, you help the AI model better understand and anticipate the language in your audio - even if some terms do not appear clearly or completely.

from pipecat.services.soniox import SonioxInputParams

SonioxInputParams(
  context="Celebrex, Zyrtec, Xanax, Prilosec, Amoxicillin Clavulanate Potassium",
),

Endpoint Detection and VAD

The SonioxSTTService processes your speech and has two ways of knowing when to finalize the text.

Automatic Pause Detection

By default, the service listens for natural pauses in your speech. When it detects that you’ve likely finished a sentence, it finalizes the transcription. You can learn more about Endpoint Detection in Soniox documentation.

Using Voice Activity Detection (VAD)

For more explicit control, you can use a dedicated Voice Activity Detection (VAD) component within your Pipecat pipeline. The VAD’s job is to detect when a user has completely stopped talking. To enable this behavior, set vad_force_turn_endpoint to True. This will disable the automatic endpoint detection and force the service to return transcription results as soon as the user stops talking.

Usage Example

from pipecat.pipeline.pipeline import Pipeline
from pipecat.services.soniox.stt import SonioxSTTService, SonioxInputParams
from pipecat.transcriptions.language import Language

# Configure the service
stt = SonioxSTTService(
    api_key=os.getenv("SONIOX_API_KEY"),
    params=SonioxInputParams(
      language_hints=[Language.EN, Language.ES, Language.JA, Language.ZH],
    ),
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    llm,
    ...
])

API Reference

Services

Utilities

Frameworks

Pipeline

Overview

API Reference

Soniox Docs

Example Code

Installation

Frames

Input

Output

Advanced Features

Language Hints

Customization with Context

Endpoint Detection and VAD

Automatic Pause Detection

Using Voice Activity Detection (VAD)

Usage Example

API Reference

Services

Utilities

Frameworks

Pipeline

​Overview

API Reference

Soniox Docs

Example Code

​Installation

​Frames

​Input

​Output

​Advanced Features

​Language Hints

​Customization with Context

​Endpoint Detection and VAD

​Automatic Pause Detection

​Using Voice Activity Detection (VAD)

​Usage Example

Overview

Installation

Frames

Input

Output

Advanced Features

Language Hints

Customization with Context

Endpoint Detection and VAD

Automatic Pause Detection

Using Voice Activity Detection (VAD)

Usage Example