Speech Input & Turn Detection

Understanding how your bot detects and processes user speech is crucial for creating natural conversations. Pipecat provides sophisticated Voice Activity Detection (VAD) and turn detection to handle the complex timing of real-time conversations.

Overview

Speech input processing involves three key components:

VAD Analyzer: Detects when users start and stop speaking
Turn Analyzer: Determines when users have finished their turn
Speech Events: System frames that coordinate pipeline behavior

These components work together to create natural conversation flow and enable interruptions.

Voice Activity Detection (VAD)

What VAD Does

VAD is responsible for detecting when a user starts and stops speaking. Pipecat uses the Silero VAD, an open-source model that runs locally on CPU with minimal overhead. Performance characteristics:

Processes 30+ms audio chunks in less than 1ms
Runs on a single CPU thread
Minimal system resource impact

VAD Configuration

VAD is configured through VADParams in your transport setup:

from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams

vad_analyzer = SileroVADAnalyzer(
    params=VADParams(
        confidence=0.7,      # Minimum confidence for voice detection
        start_secs=0.2,      # Time to wait before confirming speech start
        stop_secs=0.8,       # Time to wait before confirming speech stop
        min_volume=0.6,      # Minimum volume threshold
    )
)

transport = YourTransport(
    params=TransportParams(
        vad_analyzer=vad_analyzer,
    ),
)

Key Parameters

start_secs (default: 0.2)

How long a user must speak before VAD confirms speech has started
Lower values = more responsive, but may trigger on brief sounds
Higher values = less sensitive, but may miss quick utterances, like “yes”, “no”, or “ok”

stop_secs (default: 0.8)

How much silence must be detected before confirming speech has stopped
Critical for turn-taking behavior
Modified automatically when using turn detection

confidence and min_volume

Generally work well with defaults
Only adjust after extensive testing with your specific audio conditions

Changing confidence and min_volume requires careful profiling to ensure optimal performance across different audio environments and use cases.

Turn Detection

Beyond Simple VAD

While VAD detects speech vs. silence, it can’t understand linguistic context. Humans use grammar, tone, pace, and semantic cues to determine conversation turns. Pipecat’s turn detection brings this sophistication to voice AI.

smart-turn Model

Pipecat integrates with the smart-turn model, an open-source native audio turn detection model:

from pipecat.audio.turn.smart_turn.fal_smart_turn import FalSmartTurnAnalyzer

smart_turn_analyzer = FalSmartTurnAnalyzer(
    api_key=os.getenv("FAL_SMART_TURN_API_KEY"),
    aiohttp_session=aiohttp.ClientSession(),
)

transport = YourTransport(
    params=TransportParams(
        vad_analyzer=vad_analyzer,
        turn_analyzer=smart_turn_analyzer,  # Requires VAD to be configured
    ),
)

smart-turn V2 features:

Support for 14 languages
Community-driven development
BSD 2-clause license (truly open)

VAD + Turn Detection Integration

When using turn detection, VAD and turn analyzer work together:

VAD detects speech segments with low stop_secs (recommended: 0.2)
Turn model analyzes audio to determine if turn is complete or incomplete
VAD behavior adjusts based on turn model results:
- Complete: Normal VAD stop behavior
- Incomplete: Extends waiting time (default: 3.0 seconds)

Recommended VAD configuration with turn detection:

# Configure VAD for responsive turn detection
vad_params = VADParams(
    stop_secs=0.2,  # Low value for quick turn model analysis
)

vad_analyzer = SileroVADAnalyzer(params=vad_params)

Learn more about the VAD and Turn Analyzers in the server utilities documentation:

VAD Documentation

Complete VAD configuration reference

Turn Detection Guide

Detailed turn detection implementation guide

Speech Events & Pipeline Coordination

System Frames for Speech Events

When VAD detects speech activity, the transport emits system frames that coordinate pipeline behavior: When speech starts:

UserStartedSpeakingFrame: Informs processors that user began speaking
StartInterruptionFrame: Triggers interruption handling (if enabled)

When speech stops:

UserStoppedSpeakingFrame: Signals end of user input
StopInterruptionFrame: Resumes normal processing

Interruption Handling

Interruptions are a critical feature for natural conversations:

task = PipelineTask(
    pipeline,
    params=PipelineParams(
        allow_interruptions=True,  # Default: True (strongly recommended)
    ),
)

How interruptions work:

User starts speaking → StartInterruptionFrame emitted
System frame processed immediately (bypasses normal queues)
Current processors stop and clear their queues
Pipeline resets, ready for new user input

Best Practices

Optimal Configuration

For most voice AI use cases:

# Responsive VAD with turn detection
vad_params = VADParams(
    start_secs=0.2,
    stop_secs=0.2 if using_turn_detection else 0.8,
)

smart_turn_analyzer = FalSmartTurnAnalyzer(
    api_key=os.getenv("FAL_SMART_TURN_API_KEY"),
    aiohttp_session=aiohttp.ClientSession(),
)

transport = YourTransport(
    params=TransportParams(
        vad_analyzer=SileroVADAnalyzer(params=vad_params),
        turn_analyzer=smart_turn_analyzer if using_smart_turn_detection else None,
    ),
)

# Enable interruptions for natural flow
task = PipelineTask(
    pipeline,
    params=PipelineParams(
        allow_interruptions=True, # Default is True
    ),
)

Performance Considerations

Use local VAD: 150-200ms faster than remote VAD services
Tune for your use case: Test with real audio conditions
Monitor CPU usage: VAD adds minimal overhead but monitor in production
Consider turn detection: Improves conversation quality but adds complexity

Key Takeaways

VAD detects speech activity but turn detection understands conversation context
Configuration affects user experience - tune parameters for your specific use case
System frames coordinate behavior - enable interruptions and natural turn-taking
Local processing is faster - Silero VAD provides low-latency speech detection
Turn detection improves quality - but requires careful VAD configuration

What’s Next

Now that you understand how speech input is detected and processed, let’s explore how that audio gets converted to text through speech recognition.

Speech to Text

Learn how to configure speech recognition in your voice AI pipeline

Learning Pipecat

Fundamentals

Features

Telephony

Speech Input & Turn Detection

Overview

Voice Activity Detection (VAD)

What VAD Does

VAD Configuration

Key Parameters

Turn Detection

Beyond Simple VAD

smart-turn Model

VAD + Turn Detection Integration

VAD Documentation

Turn Detection Guide

Speech Events & Pipeline Coordination

System Frames for Speech Events

Interruption Handling

Best Practices

Optimal Configuration

Performance Considerations

Key Takeaways

What’s Next

Speech to Text

Learning Pipecat

Fundamentals

Features

Telephony

​Overview

​Voice Activity Detection (VAD)

​What VAD Does

​VAD Configuration

​Key Parameters

​Turn Detection

​Beyond Simple VAD

​smart-turn Model

​VAD + Turn Detection Integration

VAD Documentation

Turn Detection Guide

​Speech Events & Pipeline Coordination

​System Frames for Speech Events

​Interruption Handling

​Best Practices

​Optimal Configuration

​Performance Considerations

​Key Takeaways

​What’s Next

Speech to Text

Overview

Voice Activity Detection (VAD)

What VAD Does

VAD Configuration

Key Parameters

Turn Detection

Beyond Simple VAD

smart-turn Model

VAD + Turn Detection Integration

Speech Events & Pipeline Coordination

System Frames for Speech Events

Interruption Handling

Best Practices

Optimal Configuration

Performance Considerations

Key Takeaways

What’s Next