Speech to Text

Speech to Text (STT) services are responsible for converting user audio into text transcriptions. They receive audio input from users and provide real-time transcriptions that your bot can process and respond to.

Pipeline Placement

STT processors must be positioned correctly in your pipeline to receive and process audio frames:

pipeline = Pipeline([
    transport.input(),             # Creates InputAudioRawFrames
    stt,                           # Processes audio → creates TranscriptionFrames
    context_aggregator.user(),     # Uses transcriptions for context
    llm,
    tts,
    transport.output(),
])

Placement requirements:

After transport.input(): STT needs InputAudioRawFrames from the transport
Before context processing: Transcriptions must be available for context aggregation
Before LLM processing: Text must be ready for language model input

STT Service Types

Pipecat provides two types of STT services based on how they process audio:

1. STTService (Streaming)

How it works:

Establishes a WebSocket connection to the STT provider
Continuously streams audio for real-time transcription
Lower latency due to persistent connection

2. SegmentedSTTService (HTTP-based)

How it works:

Uses local VAD (Voice Activity Detection) to chunk speech
Sends audio segments to STT service as wav files
Higher latency due to segmentation and HTTP POST requests

STT services are modular and can be swapped out with no additional overhead. You can easily switch between streaming and segmented services based on your needs.

Supported STT Services

Pipecat supports a wide range of STT providers to fit different needs and budgets:

Supported STT Services

View the complete list of supported speech-to-text providers

Popular options include:

Deepgram

Fast, accurate streaming STT with excellent real-time performance

Speechmatics

Advanced speech recognition with strong accent and dialect handling

AssemblyAI

AI-powered transcription with speaker diarization and sentiment analysis

Gladia

High-performance STT with multilingual support and custom models

Azure Speech

Microsoft’s enterprise-grade STT service with extensive language support

Google Speech-to-Text

Reliable transcription with strong language model integration

STT Configuration

Service-Specific Configuration

Each STT service has its own customization options. Refer to specific service documentation for details:

Individual STT Services

Explore configuration options for each supported STT provider

For example, let’s look at configuring the DeepgramSTTService using the LiveOptions class:

from deepgram import LiveOptions
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.transcriptions.language import Language

# Configure using LiveOptions for full control
live_options = LiveOptions(
    model="nova-2",
    language=Language.EN_US,
    interim_results=True,        # Enable interim transcripts
    smart_format=True,           # Enable punctuation and formatting
    punctuate=True,              # Add punctuation
    profanity_filter=True,       # Filter profanity
    vad_events=False,            # Use pipeline VAD instead
)

stt = DeepgramSTTService(
    api_key=os.getenv("DEEPGRAM_API_KEY"),
    live_options=live_options,
)

STTService Base Class Configuration

All STT services inherit from the STTService base class. The base class has base configuration options which are set with smart defaults:

stt = YourSTTService(
    # Service-specific options...
    audio_passthrough=True,      # Pass audio frames downstream (recommended)
    sample_rate=16000,           # Audio sample rate (better set in PipelineParams)
)

Key options:

audio_passthrough=True: Allows audio frames to continue downstream to other processors (like audio recording)
sample_rate: Audio sampling rate - best practice is to set the audio_in_sample_rate in PipelineParams for consistency

Setting audio_passthrough=False will stop audio frames from being passed downstream, which may break audio recording or other audio-dependent processors.

Pipeline-Level Audio Configuration

Instead of setting sample rates on individual services, configure them pipeline-wide:

task = PipelineTask(
    pipeline,
    params=PipelineParams(
        audio_in_sample_rate=16000,   # All input processors use this rate
        audio_out_sample_rate=24000,  # All output processors use this rate
    ),
)

This ensures all audio processors use consistent sample rates without manual configuration.

Always set audio sample rates in PipelineParams to avoid mismatches between different audio processors. This simplifies configuration and ensures consistent audio quality across your pipeline.

Best Practices

Enable Interim Results

When available, enable interim transcripts for better user experience:

stt = DeepgramSTTService(
    api_key=os.getenv("DEEPGRAM_API_KEY"),
    live_options=LiveOptions(
      interim_results=True,
    )
)

Benefits:

Notifies context aggregation that more text is coming
Prevents premature LLM completions
Enables interruption detection
Improves conversation flow

Enable Punctuation and Formatting

Use smart formatting when available:

stt = DeepgramSTTService(
    api_key=os.getenv("DEEPGRAM_API_KEY"),
    live_options=LiveOptions(
        smart_format=True,     # Adds punctuation and capitalization
        profanity_filter=True, # Optional content filtering
    )
)

Benefits:

Professional-looking transcripts
Better LLM comprehension
Eliminates post-processing needs
Improved context understanding

Use Local VAD

While many STT services provide Voice Activity Detection, use Pipecat’s local Silero VAD for better performance:

from pipecat.audio.vad.silero import SileroVADAnalyzer

# Configure in transport params
transport = YourTransport(
    params=TransportParams(
        vad_analyzer=SileroVADAnalyzer(),  # 150-200ms faster than remote VAD
    ),
)

Advantages:

150-200ms faster speech detection (no network round trip)
More responsive conversation flow
Better interruption handling
Reduced latency overall

Key Takeaways

Pipeline placement matters - STT must come after transport input, before context processing
Service types differ - streaming services have lower latency than segmented
Services are modular - easily swap providers without code changes
Best practices improve performance - use interim results, formatting, and local VAD
Configuration affects quality - proper setup significantly impacts transcription accuracy

What’s Next

Now that you understand speech recognition, let’s explore how to manage conversation context and memory in your voice AI bot.

Context Management

Learn how to handle conversation history and context in your pipeline

Learning Pipecat

Fundamentals

Features

Telephony

Pipeline Placement

STT Service Types

1. STTService (Streaming)

2. SegmentedSTTService (HTTP-based)

Supported STT Services

Supported STT Services

Deepgram

Speechmatics

AssemblyAI

Gladia

Azure Speech

Google Speech-to-Text

STT Configuration

Service-Specific Configuration

Individual STT Services

STTService Base Class Configuration

Pipeline-Level Audio Configuration

Best Practices

Enable Interim Results

Enable Punctuation and Formatting

Use Local VAD

Key Takeaways

What’s Next

Context Management

Learning Pipecat

Fundamentals

Features

Telephony

​Pipeline Placement

​STT Service Types

​1. STTService (Streaming)

​2. SegmentedSTTService (HTTP-based)

​Supported STT Services

Supported STT Services

Deepgram

Speechmatics

AssemblyAI

Gladia

Azure Speech

Google Speech-to-Text

​STT Configuration

​Service-Specific Configuration

Individual STT Services

​STTService Base Class Configuration

​Pipeline-Level Audio Configuration

​Best Practices

​Enable Interim Results

​Enable Punctuation and Formatting

​Use Local VAD

​Key Takeaways

​What’s Next

Context Management

Pipeline Placement

STT Service Types

1. STTService (Streaming)

2. SegmentedSTTService (HTTP-based)

Supported STT Services

STT Configuration

Service-Specific Configuration

STTService Base Class Configuration

Pipeline-Level Audio Configuration

Best Practices

Enable Interim Results

Enable Punctuation and Formatting

Use Local VAD

Key Takeaways

What’s Next