Overview
Inworld AI provides high-quality text-to-speech synthesis with natural-sounding voices and real-time streaming capabilities. The service supports both streaming and non-streaming modes, making it suitable for various use cases from low-latency conversational AI to batch audio generation.Streaming mode is recommended for real-time applications requiring low
latency.
API Reference
Complete API documentation and method details
Inworld AI Docs
Official Inworld TTS API documentation
Example Code
Working example with Inworld TTS
Installation
To use Inworld services, no additional dependencies are required beyond the base installation:INWORLD_API_KEY
.
Get your API key from Inworld Studio. Make sure
to base64-encode your API key.
Frames
Input
TextFrame
- Text content to synthesize into speechTTSSpeakFrame
- Text that should be spoken immediatelyTTSUpdateSettingsFrame
- Runtime configuration updatesLLMFullResponseStartFrame
/LLMFullResponseEndFrame
- LLM response boundaries
Output
TTSStartedFrame
- Signals start of synthesisTTSAudioRawFrame
- Generated audio data (LINEAR16 PCM, WAV header stripped)TTSStoppedFrame
- Signals completion of synthesisErrorFrame
- API or processing errors
Features
- High-Quality Voices: Natural-sounding voices including Ashley, Hades, and more
- Streaming & Non-Streaming: Unified interface supporting both real-time and batch processing
- Automatic Language Detection: No need to specify language manually - Inworld detects it from your text
- Voice Temperature Control: Accepts 0-2 (best results 0.6 to 1.0); lower values yield steadier, deterministic speech, while higher values add expressive variation.
- Model Selection: Choose
inworld‑tts‑1
for real‑time, cost‑sensitive use (lowest latency); useinworld‑tts‑1‑max
(experimental) when you can trade a bit more latency for richer expressiveness and broader multilingual support. - Professional-quality Audio Output: LINEAR16 PCM audio at up to 48kHz
Audio Markups
Inworld supports experimental audio markups for enhanced expressiveness in English: Emotion and Delivery Style (use at beginning of text):- Emotions:
[happy]
,[sad]
,[angry]
,[surprised]
,[fearful]
,[disgusted]
- Delivery Styles:
[laughing]
,[whispering]
- Sound Effects:
[breathe]
,[clear_throat]
,[cough]
,[laugh]
,[sigh]
,[yawn]
Audio markup features are experimental and currently support English only. For
best results, use only one emotion/delivery style at the beginning of text.
For detailed usage guidelines and best practices, refer to Inworld’s
documentation on Audio Markups Best
Practices.
Usage Examples
Streaming Mode (Real-time)
Perfect for conversational AI applications requiring low latency:Non-Streaming Mode (Complete Audio)
Ideal for scenarios where you need the complete audio file before playback:Streaming vs Non-Streaming
Mode | Best For | Use Cases |
---|---|---|
Streaming | Real-time applications | Building conversational AI, minimal latency interactions, processing text as available |
Non-Streaming | Batch processing | Longer content generation, complete audio files, batch scenarios, slighly better quality |
Audio Specifications
- Sample Rate Range: 8kHz - 48kHz (default comes from StartFrame)
- Bit Depth: 16-bit
- Encoding: LINEAR16 PCM (uncompressed)
- Format: WAV headers automatically stripped
Sample Rate | Quality | Use Case |
---|---|---|
16000 Hz | Basic | Voice calls, simple applications |
24000 Hz | Good | General conversational AI |
48000 Hz | High | Professional applications, music |
Monitoring and Metrics
- Time To First Byte (TTFB): Latency measurement from request start to first audio chunk
- Processing Time: Total duration for the complete TTS operation
- Usage Metrics: Character count of processed text for billing and analytics
Learn how to enable Metrics in your Pipeline.
Resources
- Inworld AI Documentation
- TTS API Reference
- Inworld Studio - Voice management and API keys
- Audio Markups Best Practices - Techniques for optimal markup usage
- Pipecat Examples - Sample implementations