Inworld

Overview

Inworld AI provides high-quality text-to-speech synthesis with natural-sounding voices and real-time streaming capabilities. The service supports both streaming and non-streaming modes, making it suitable for various use cases from low-latency conversational AI to batch audio generation.

Streaming mode is recommended for real-time applications requiring low latency.

API Reference

Complete API documentation and method details

Inworld AI Docs

Official Inworld TTS API documentation

Example Code

Working example with Inworld TTS

Installation

To use Inworld services, no additional dependencies are required beyond the base installation:

pip install "pipecat-ai"

You’ll also need to set up your Inworld API key as an environment variable: INWORLD_API_KEY.

Get your API key from Inworld Studio. Make sure to base64-encode your API key.

Frames

Input

TextFrame - Text content to synthesize into speech
TTSSpeakFrame - Text that should be spoken immediately
TTSUpdateSettingsFrame - Runtime configuration updates
LLMFullResponseStartFrame / LLMFullResponseEndFrame - LLM response boundaries

Output

TTSStartedFrame - Signals start of synthesis
TTSAudioRawFrame - Generated audio data (LINEAR16 PCM, WAV header stripped)
TTSStoppedFrame - Signals completion of synthesis
ErrorFrame - API or processing errors

Features

High-Quality Voices: Natural-sounding voices including Ashley, Hades, and more
Streaming & Non-Streaming: Unified interface supporting both real-time and batch processing
Automatic Language Detection: No need to specify language manually - Inworld detects it from your text
Voice Temperature Control: Accepts 0-2 (best results 0.6 to 1.0); lower values yield steadier, deterministic speech, while higher values add expressive variation.
Model Selection: Choose inworld‑tts‑1 for real‑time, cost‑sensitive use (lowest latency); use inworld‑tts‑1‑max (experimental) when you can trade a bit more latency for richer expressiveness and broader multilingual support.
Professional-quality Audio Output: LINEAR16 PCM audio at up to 48kHz

Audio Markups

Inworld supports experimental audio markups for enhanced expressiveness in English: Emotion and Delivery Style (use at beginning of text):

Emotions: [happy], [sad], [angry], [surprised], [fearful], [disgusted]
Delivery Styles: [laughing], [whispering]

Non-verbal Vocalizations (place anywhere in text):

Sound Effects: [breathe], [clear_throat], [cough], [laugh], [sigh], [yawn]

Audio markup features are experimental and currently support English only. For best results, use only one emotion/delivery style at the beginning of text. For detailed usage guidelines and best practices, refer to Inworld’s documentation on Audio Markups Best Practices.

Usage Examples

Streaming Mode (Real-time)

Perfect for conversational AI applications requiring low latency:

import asyncio
import aiohttp
import os
from pipecat.services.inworld.tts import InworldTTSService

async def main():
    async with aiohttp.ClientSession() as session:
        tts = InworldTTSService(
            api_key=os.getenv("INWORLD_API_KEY"),
            aiohttp_session=session,
            voice_id="Ashley",
            model="inworld-tts-1",
            streaming=True,  # Use streaming mode for real-time audio
            params=InworldTTSService.InputParams(
                temperature=0.8,
            ),
        )

        # Use in your pipeline
        # pipeline = Pipeline([...other_processors..., tts, ...])

asyncio.run(main())

Non-Streaming Mode (Complete Audio)

Ideal for scenarios where you need the complete audio file before playback:

tts = InworldTTSService(
    api_key=os.getenv("INWORLD_API_KEY"),
    aiohttp_session=session,
    voice_id="Hades",
    model="inworld-tts-1-max",  # Higher quality model
    streaming=False,  # Complete audio generation first
    params=InworldTTSService.InputParams(
        temperature=1.2,  # More expressive speech
    ),
)

Streaming vs Non-Streaming

Mode	Best For	Use Cases
Streaming	Real-time applications	Building conversational AI, minimal latency interactions, processing text as available
Non-Streaming	Batch processing	Longer content generation, complete audio files, batch scenarios, slighly better quality

Audio Specifications

Sample Rate Range: 8kHz - 48kHz (default comes from StartFrame)
Bit Depth: 16-bit
Encoding: LINEAR16 PCM (uncompressed)
Format: WAV headers automatically stripped

Sample Rate	Quality	Use Case
16000 Hz	Basic	Voice calls, simple applications
24000 Hz	Good	General conversational AI
48000 Hz	High	Professional applications, music

Monitoring and Metrics

Time To First Byte (TTFB): Latency measurement from request start to first audio chunk
Processing Time: Total duration for the complete TTS operation
Usage Metrics: Character count of processed text for billing and analytics

Learn how to enable Metrics in your Pipeline.

Resources

Inworld AI Documentation
TTS API Reference
Inworld Studio - Voice management and API keys
Audio Markups Best Practices - Techniques for optimal markup usage
Pipecat Examples - Sample implementations

API Reference

Services

Utilities

Frameworks

Pipeline

Overview

API Reference

Inworld AI Docs

Example Code

Installation

Frames

Input

Output

Features

Audio Markups

Usage Examples

Streaming Mode (Real-time)

Non-Streaming Mode (Complete Audio)

Streaming vs Non-Streaming

Audio Specifications

Monitoring and Metrics

Resources

API Reference

Services

Utilities

Frameworks

Pipeline

​Overview

API Reference

Inworld AI Docs

Example Code

​Installation

​Frames

​Input

​Output

​Features

​Audio Markups

​Usage Examples

​Streaming Mode (Real-time)

​Non-Streaming Mode (Complete Audio)

​Streaming vs Non-Streaming

​Audio Specifications

​Monitoring and Metrics

​Resources

Overview

Installation

Frames

Input

Output

Features

Audio Markups

Usage Examples

Streaming Mode (Real-time)

Non-Streaming Mode (Complete Audio)

Streaming vs Non-Streaming

Audio Specifications

Monitoring and Metrics

Resources