Pipeline Placement
TTS processors must be positioned correctly in your pipeline to receive text and generate audio frames:- After LLM processing: TTS needs
LLMTextFrame
s from language model responses - Before transport output: Audio must be generated before sending to user
- Before assistant context aggregator: Ensures spoken text is captured in conversation history
Frame Processing Flow
TTS generates speech through two primary mechanisms:-
Streamed LLM tokens via
LLMTextFrame
s:- TTS aggregates streaming tokens into complete sentences
- Sentences are sent to TTS service for audio generation
- Audio bytes stream back and play immediately
- End-to-end latency often under 200ms
-
Direct speech requests via
TTSSpeakFrame
s:- Bypasses LLM and context aggregators
- Immediate audio generation for specific text
- Useful for system messages or prompts
TTSAudioRawFrame
s: Raw audio data for playbackTTSTextFrame
s: Text that was actually spoken (for context updates)TTSStartedFrame
/TTSStoppedFrame
: Speech boundary markers
Supported TTS Services
Pipecat supports a wide range of TTS providers with different capabilities and performance characteristics:Supported TTS Services
View the complete list of supported text-to-speech providers
Service Categories
WebSocket-Based Services (Recommended):- Cartesia: Ultra-low latency with word timestamps
- ElevenLabs: High-quality voices with emotion control
- Rime: Ultra-realistic voices with advanced features
- OpenAI TTS: High-quality synthesis with multiple voices
- Azure Speech: Enterprise-grade with extensive language support
- Google Text-to-Speech: Reliable with WaveNet voices
- Word timestamps: Enable word-level accuracy for context and subtitles
- Voice cloning: Custom voice creation from samples
- Emotion control: Dynamic emotional expression
- SSML support: Fine-grained pronunciation control
WebSocket services typically provide the lowest latency, while HTTP services
may have intermittent higher latency due to their request/response nature.
TTS Configuration
Service-Specific Configuration
Each TTS service has its own configuration options. Here’s an example with Cartesia:Individual TTS Services
Explore configuration options for each supported TTS provider
Pipeline-Level Audio Configuration
Set consistent audio settings across your entire pipeline:Set the
audio_out_sample_rate
to match your TTS service’s requirements for
optimal quality. This is preferred to setting the sample_rate directly in the
TTS service as the PipelineParam ensures that all output sample_rates match.Text Processing and Filtering
Custom Text Aggregation
Control how streaming text is processed before synthesis:Text Filters
Apply preprocessing to text before synthesis:- MarkdownTextFilter: Strips markdown formatting from LLM responses
- Custom filters: Implement your own text preprocessing logic
Advanced TTS Features
Direct Speech Commands
UseTTSSpeakFrame
for immediate speech synthesis:
Dynamic Settings Updates
Update TTS settings during conversation:Key Takeaways
- Pipeline placement matters - TTS must come after LLM, before transport output
- Service types differ - WebSocket services provide lower latency than HTTP
- Text processing affects quality - use aggregation and filters for better results
- Word timestamps enable precision - better interruption handling and context accuracy
- Configuration impacts performance - balance quality, latency, and bandwidth needs
- Services are modular - easily swap providers without changing pipeline code
What’s Next
You’ve now learned how to build a complete voice AI pipeline! Let’s explore some additional topics to enhance your implementation.Pipeline Termination
Learn how to terminate your voice AI pipeline at the end of a conversation