XTTS

Coqui, the XTTS maintainer, has shut down. XTTS may not receive future updates or support.

Overview

XTTS (Cross-lingual Text-to-Speech) provides multilingual voice synthesis with voice cloning capabilities through a locally hosted streaming server. The service supports real-time streaming and custom voice training using Coqui’s XTTS-v2 model.

API Reference

Complete API documentation and method details

XTTS Repository

Official XTTS streaming server repository

Example Code

Working example with voice cloning

Installation

XTTS requires a running streaming server. Start the server using Docker:

docker run --gpus=all -e COQUI_TOS_AGREED=1 --rm -p 8000:80 \
  ghcr.io/coqui-ai/xtts-streaming-server:latest-cuda121

GPU acceleration is recommended for optimal performance. The server requires CUDA support.

Frames

Input

TextFrame - Text content to synthesize into speech
TTSSpeakFrame - Text that should be spoken immediately
TTSUpdateSettingsFrame - Runtime configuration updates
LLMFullResponseStartFrame / LLMFullResponseEndFrame - LLM response boundaries

Output

TTSStartedFrame - Signals start of synthesis
TTSAudioRawFrame - Generated audio data (streaming, resampled from 24kHz)
TTSStoppedFrame - Signals completion of synthesis
ErrorFrame - Server connection or processing errors

Language Support

XTTS supports multiple languages with cross-lingual capabilities:

Language Code	Description	Service Code
`Language.CS`	Czech	`cs`
`Language.DE`	German	`de`
`Language.EN`	English	`en`
`Language.ES`	Spanish	`es`
`Language.FR`	French	`fr`
`Language.HI`	Hindi	`hi`
`Language.HU`	Hungarian	`hu`
`Language.IT`	Italian	`it`
`Language.JA`	Japanese	`ja`
`Language.KO`	Korean	`ko`
`Language.NL`	Dutch	`nl`
`Language.PL`	Polish	`pl`
`Language.PT`	Portuguese	`pt`
`Language.RU`	Russian	`ru`
`Language.TR`	Turkish	`tr`
`Language.ZH`	Chinese (Simplified)	`zh-cn`

Usage Example

Basic Configuration

Initialize the XTTSService and use it in a pipeline:

from pipecat.services.xtts.tts import XTTSService
from pipecat.transcriptions.language import Language
import aiohttp

async def setup_tts():
    # Create HTTP session for server communication
    session = aiohttp.ClientSession()

    tts = XTTSService(
        aiohttp_session=session,
        voice_id="Claribel Dervla",
        base_url="http://localhost:8000",
        language=Language.EN
    )

    # Use in pipeline
    pipeline = Pipeline([
        transport.input(),
        stt,
        context_aggregator.user(),
        llm,
        tts,
        transport.output(),
        context_aggregator.assistant()
    ])

    return pipeline, session

Dynamic Configuration

Make settings updates by pushing an TTSUpdateSettingsFrame for the XTTSService:

from pipecat.frames.frames import TTSUpdateSettingsFrame

await task.queue_frame(
    TTSUpdateSettingsFrame(settings={"voice": "your-new-voice-id"})
)

Metrics

The service provides comprehensive metrics:

Time to First Byte (TTFB) - Latency from text input to first audio
Processing Duration - Total synthesis time
Streaming Performance - Buffer utilization and chunk processing

Learn how to enable Metrics in your Pipeline.

Additional Notes

Local Deployment: Runs entirely on local infrastructure for privacy
Voice Cloning: Supports custom voice training with audio samples
Cross-lingual: Can synthesize multiple languages with same voice

API Reference

Services

Utilities

Frameworks

Pipeline

Overview

API Reference

XTTS Repository

Example Code

Installation

Frames

Input

Output

Language Support

Usage Example

Basic Configuration

Dynamic Configuration

Metrics

Additional Notes

API Reference

Services

Utilities

Frameworks

Pipeline

​Overview

API Reference

XTTS Repository

Example Code

​Installation

​Frames

​Input

​Output

​Language Support

​Usage Example

​Basic Configuration

​Dynamic Configuration

​Metrics

​Additional Notes

Overview

Installation

Frames

Input

Output

Language Support

Usage Example

Basic Configuration

Dynamic Configuration

Metrics

Additional Notes