Coqui, the XTTS maintainer, has shut down. XTTS may not receive future updates
or support.
Overview
XTTS (Cross-lingual Text-to-Speech) provides multilingual voice synthesis with voice cloning capabilities through a locally hosted streaming server. The service supports real-time streaming and custom voice training using Coqui’s XTTS-v2 model.API Reference
Complete API documentation and method details
XTTS Repository
Official XTTS streaming server repository
Example Code
Working example with voice cloning
Installation
XTTS requires a running streaming server. Start the server using Docker:GPU acceleration is recommended for optimal performance. The server requires
CUDA support.
Frames
Input
TextFrame
- Text content to synthesize into speechTTSSpeakFrame
- Text that should be spoken immediatelyTTSUpdateSettingsFrame
- Runtime configuration updatesLLMFullResponseStartFrame
/LLMFullResponseEndFrame
- LLM response boundaries
Output
TTSStartedFrame
- Signals start of synthesisTTSAudioRawFrame
- Generated audio data (streaming, resampled from 24kHz)TTSStoppedFrame
- Signals completion of synthesisErrorFrame
- Server connection or processing errors
Language Support
XTTS supports multiple languages with cross-lingual capabilities:Language Code | Description | Service Code |
---|---|---|
Language.CS | Czech | cs |
Language.DE | German | de |
Language.EN | English | en |
Language.ES | Spanish | es |
Language.FR | French | fr |
Language.HI | Hindi | hi |
Language.HU | Hungarian | hu |
Language.IT | Italian | it |
Language.JA | Japanese | ja |
Language.KO | Korean | ko |
Language.NL | Dutch | nl |
Language.PL | Polish | pl |
Language.PT | Portuguese | pt |
Language.RU | Russian | ru |
Language.TR | Turkish | tr |
Language.ZH | Chinese (Simplified) | zh-cn |
Usage Example
Basic Configuration
Initialize theXTTSService
and use it in a pipeline:
Dynamic Configuration
Make settings updates by pushing anTTSUpdateSettingsFrame
for the XTTSService
:
Metrics
The service provides comprehensive metrics:- Time to First Byte (TTFB) - Latency from text input to first audio
- Processing Duration - Total synthesis time
- Streaming Performance - Buffer utilization and chunk processing
Learn how to enable Metrics in your Pipeline.
Additional Notes
- Local Deployment: Runs entirely on local infrastructure for privacy
- Voice Cloning: Supports custom voice training with audio samples
- Cross-lingual: Can synthesize multiple languages with same voice