Overview
CartesiaSTTService
provides real-time speech recognition using Cartesia’s WebSocket API with the ink-whisper
model, supporting streaming transcription with both interim and final results.
API Reference
Complete API documentation and method details
Cartesia Docs
Official Cartesia STT documentation and features
Example Code
Working example with transcription logging
Installation
To use Cartesia services, install the required dependency:CARTESIA_API_KEY
.
Get your API key from Cartesia.
Frames
Input
InputAudioRawFrame
- Raw PCM audio data (16-bit, 16kHz, mono)UserStartedSpeakingFrame
- Triggers metrics collectionUserStoppedSpeakingFrame
- Sends finalize command to flush sessionSTTUpdateSettingsFrame
- Runtime transcription configuration updatesSTTMuteFrame
- Mute audio input for transcription
Output
InterimTranscriptionFrame
- Real-time transcription updatesTranscriptionFrame
- Final transcription resultsErrorFrame
- Connection or processing errors
Models
Cartesia currently offers one primary STT model:Model | Description | Best For |
---|---|---|
ink-whisper | Cartesia’s optimized Whisper implementation | General-purpose real-time transcription |
Language Support
Cartesia STT supports multiple languages through standard language codes:Language Code | Description | Service Codes |
---|---|---|
Language.EN | English (US) | en |
Language.ES | Spanish | es |
Language.FR | French | fr |
Language.DE | German | de |
Language.IT | Italian | it |
Language.PT | Portuguese | pt |
Language.NL | Dutch | nl |
Language.PL | Polish | pl |
Language.RU | Russian | ru |
Language.JA | Japanese | ja |
Language.KO | Korean | ko |
Language.ZH | Chinese | zh |
Language support may vary. Check Cartesia’s
documentation for the most
current language list.
Usage Example
Basic Configuration
Initialize theCartesiaSTTService
and use it in a pipeline:
Dynamic Configuration
Make settings updates by pushing anSTTUpdateSettingsFrame
for the CartesiaSTTService
:
Live Options Configuration
Metrics
The service provides comprehensive metrics:- Time to First Byte (TTFB) - Latency from audio input to first transcription
- Processing Duration - Total time spent processing audio
Learn how to enable Metrics in your Pipeline.
Additional Notes
- Audio Format: Expects PCM S16LE format at 16kHz sample rate by default
- Session Management: Each connection represents a transcription session that can be finalized
- Interim Results: Provides real-time interim transcriptions before final results
- Language Detection: Automatic language detection available in transcription responses