Pipeline Placement
STT processors must be positioned correctly in your pipeline to receive and process audio frames:- After
transport.input()
: STT needsInputAudioRawFrame
s from the transport - Before context processing: Transcriptions must be available for context aggregation
- Before LLM processing: Text must be ready for language model input
STT Service Types
Pipecat provides two types of STT services based on how they process audio:1. STTService (Streaming)
How it works:- Establishes a WebSocket connection to the STT provider
- Continuously streams audio for real-time transcription
- Lower latency due to persistent connection
2. SegmentedSTTService (HTTP-based)
How it works:- Uses local VAD (Voice Activity Detection) to chunk speech
- Sends audio segments to STT service as wav files
- Higher latency due to segmentation and HTTP POST requests
STT services are modular and can be swapped out with no additional overhead.
You can easily switch between streaming and segmented services based on your
needs.
Supported STT Services
Pipecat supports a wide range of STT providers to fit different needs and budgets:Supported STT Services
View the complete list of supported speech-to-text providers
Deepgram
Fast, accurate streaming STT with excellent real-time performance
Speechmatics
Advanced speech recognition with strong accent and dialect handling
AssemblyAI
AI-powered transcription with speaker diarization and sentiment analysis
Gladia
High-performance STT with multilingual support and custom models
Azure Speech
Microsoft’s enterprise-grade STT service with extensive language support
Google Speech-to-Text
Reliable transcription with strong language model integration
STT Configuration
Service-Specific Configuration
Each STT service has its own customization options. Refer to specific service documentation for details:Individual STT Services
Explore configuration options for each supported STT provider
LiveOptions
class:
STTService Base Class Configuration
All STT services inherit from the STTService base class. The base class has base configuration options which are set with smart defaults:audio_passthrough=True
: Allows audio frames to continue downstream to other processors (like audio recording)sample_rate
: Audio sampling rate - best practice is to set theaudio_in_sample_rate
inPipelineParams
for consistency
Setting
audio_passthrough=False
will stop audio frames from being passed
downstream, which may break audio recording or other audio-dependent
processors.Pipeline-Level Audio Configuration
Instead of setting sample rates on individual services, configure them pipeline-wide:Always set audio sample rates in
PipelineParams
to avoid mismatches between
different audio processors. This simplifies configuration and ensures
consistent audio quality across your pipeline.Best Practices
Enable Interim Results
When available, enable interim transcripts for better user experience:- Notifies context aggregation that more text is coming
- Prevents premature LLM completions
- Enables interruption detection
- Improves conversation flow
Enable Punctuation and Formatting
Use smart formatting when available:- Professional-looking transcripts
- Better LLM comprehension
- Eliminates post-processing needs
- Improved context understanding
Use Local VAD
While many STT services provide Voice Activity Detection, use Pipecat’s local Silero VAD for better performance:- 150-200ms faster speech detection (no network round trip)
- More responsive conversation flow
- Better interruption handling
- Reduced latency overall
Key Takeaways
- Pipeline placement matters - STT must come after transport input, before context processing
- Service types differ - streaming services have lower latency than segmented
- Services are modular - easily swap providers without code changes
- Best practices improve performance - use interim results, formatting, and local VAD
- Configuration affects quality - proper setup significantly impacts transcription accuracy
What’s Next
Now that you understand speech recognition, let’s explore how to manage conversation context and memory in your voice AI bot.Context Management
Learn how to handle conversation history and context in your pipeline