Overview
UltravoxSTTService
provides real-time speech-to-text using the Ultravox multimodal model running locally. Ultravox directly encodes audio into the LLM’s embedding space, eliminating traditional ASR components and providing faster, more efficient transcription with built-in conversational understanding.
API Reference
Complete API documentation and method details
Ultravox Docs
Official Ultravox documentation and features
Example Code
Working example with GPU optimization
Installation
To use Ultravox services, install the required dependency:HF_TOKEN
.
Get your Hugging Face token from Hugging Face
Settings.
Frames
Input
InputAudioRawFrame
- Raw PCM audio data (16-bit, 16kHz, mono)UserStartedSpeakingFrame
- Triggers audio bufferingUserStoppedSpeakingFrame
- Processes collected audioSTTUpdateSettingsFrame
- Runtime transcription configuration updatesSTTMuteFrame
- Mute audio input for transcription
Output
LLMFullResponseStartFrame
- Indicates transcription generation startLLMTextFrame
- Streaming text tokens as they’re generatedLLMFullResponseEndFrame
- Indicates transcription completionErrorFrame
- Processing errors or resource issues
Models
Ultravox offers several models with different resource requirements:fixie-ai/ultravox-v0_6-llama-3_3-70b
- Latest model with improved accuracy and efficiencyfixie-ai/ultravox-v0_5-llama-3_3-70b
- Recommended for new deploymentsfixie-ai/ultravox-v0_5-llama-3_1-8b
- Smaller model for resource-constrained environmentsfixie-ai/ultravox-v0_4_1-llama-3_1-8b
- Previous version for compatibilityfixie-ai/ultravox-v0_4_1-llama-3_1-70b
- Larger model for high accuracy
Usage Example
Basic Configuration
Metrics
The service provides comprehensive metrics:- Time to First Byte (TTFB) - Latency from speech end to first token
- Processing Duration - Total time for audio processing and generation
Learn how to enable Metrics in your Pipeline.
Additional Notes
- VAD Dependency: Requires Voice Activity Detection (VAD) to trigger audio processing
- GPU Acceleration: Designed for GPU deployment; consider using Cerebrium, Modal, or other GPU-optimized environments
- Model Loading: First model load can take several minutes; consider pre-initialization
- Memory Usage: Audio buffer grows with speech duration; automatically cleared after processing
- Output Format: Generates
LLMTextFrame
objects, not traditionalTranscriptionFrame
- Local Processing: All processing happens locally; no external API calls after model download
- Hugging Face Authentication: Required for downloading models from Hugging Face Hub