Overview
Fish Audio provides real-time text-to-speech synthesis through a WebSocket-based streaming API. The service offers custom voice models, prosody controls, and multiple audio formats optimized for conversational AI applications with low latency.API Reference
Complete API documentation and method details
Fish Audio Docs
Official Fish Audio WebSocket API documentation
Example Code
Working example with custom voice model
Installation
To use Fish Audio services, install the required dependencies:FISH_API_KEY
.
Get your API key from the Fish Audio Console.
Frames
Input
TextFrame
- Text content to synthesize into speechTTSSpeakFrame
- Text that should be spoken immediatelyTTSUpdateSettingsFrame
- Runtime configuration updatesLLMFullResponseStartFrame
/LLMFullResponseEndFrame
- LLM response boundaries
Output
TTSStartedFrame
- Signals start of synthesisTTSAudioRawFrame
- Generated audio data chunks (streaming)TTSStoppedFrame
- Signals completion of synthesisErrorFrame
- API or processing errors
Sample Rate Options
Supported sample rates for different quality levels:- 8000 Hz - Phone quality
- 16000 Hz - Standard quality
- 24000 Hz - High quality (recommended)
- 44100 Hz - CD quality
- 48000 Hz - Professional quality
Language Support
Fish Audio currently supports:Language Code | Description | Service Code |
---|---|---|
Language.EN | English | en |
Language.JA | Japanese | ja |
Language.ZH | Chinese | zh |
Fish Audio is expanding language support. Check the official
documentation for the latest available languages.
Latency Modes
Choose the appropriate latency mode for your application:Mode | Description | Best For |
---|---|---|
normal | Standard latency (Default) | General applications |
balanced | Balanced quality/speed | Real-time conversations |
Usage Example
Basic Configuration
Advanced Prosody Control
Dynamic Configuration
Make settings updates by pushing aTTSUpdateSettingsFrame
:
Metrics
The service provides comprehensive metrics:- Time to First Byte (TTFB) - Latency from text input to first audio
- Processing Duration - Total synthesis time
- Character Usage - Text processed for billing
Learn how to enable Metrics in your Pipeline.
Additional Notes
- WebSocket Streaming: Real-time audio generation with automatic chunking
- Interruption Handling: Built-in support for conversation interruptions
- Custom Voice Models: Use your own trained voice models via reference IDs
- Audio Buffering: Efficient streaming with configurable buffer sizes
- Connection Management: Automatic reconnection on connection failures
- Format Flexibility: Multiple audio formats for different deployment scenarios
- Prosody Control: Fine-tune speech characteristics including speed and volume