Overview
Speech input processing involves three key components:- VAD Analyzer: Detects when users start and stop speaking
- Turn Analyzer: Determines when users have finished their turn
- Speech Events: System frames that coordinate pipeline behavior
Voice Activity Detection (VAD)
What VAD Does
VAD is responsible for detecting when a user starts and stops speaking. Pipecat uses the Silero VAD, an open-source model that runs locally on CPU with minimal overhead. Performance characteristics:- Processes 30+ms audio chunks in less than 1ms
- Runs on a single CPU thread
- Minimal system resource impact
VAD Configuration
VAD is configured throughVADParams
in your transport setup:
Key Parameters
start_secs
(default: 0.2)
- How long a user must speak before VAD confirms speech has started
- Lower values = more responsive, but may trigger on brief sounds
- Higher values = less sensitive, but may miss quick utterances, like “yes”, “no”, or “ok”
stop_secs
(default: 0.8)
- How much silence must be detected before confirming speech has stopped
- Critical for turn-taking behavior
- Modified automatically when using turn detection
confidence
and min_volume
- Generally work well with defaults
- Only adjust after extensive testing with your specific audio conditions
Changing confidence and min_volume requires careful profiling to ensure
optimal performance across different audio environments and use cases.
Turn Detection
Beyond Simple VAD
While VAD detects speech vs. silence, it can’t understand linguistic context. Humans use grammar, tone, pace, and semantic cues to determine conversation turns. Pipecat’s turn detection brings this sophistication to voice AI.smart-turn Model
Pipecat integrates with the smart-turn model, an open-source native audio turn detection model:- Support for 14 languages
- Community-driven development
- BSD 2-clause license (truly open)
VAD + Turn Detection Integration
When using turn detection, VAD and turn analyzer work together:- VAD detects speech segments with low
stop_secs
(recommended: 0.2) - Turn model analyzes audio to determine if turn is complete or incomplete
- VAD behavior adjusts based on turn model results:
- Complete: Normal VAD stop behavior
- Incomplete: Extends waiting time (default: 3.0 seconds)
VAD Documentation
Complete VAD configuration reference
Turn Detection Guide
Detailed turn detection implementation guide
Speech Events & Pipeline Coordination
System Frames for Speech Events
When VAD detects speech activity, the transport emits system frames that coordinate pipeline behavior: When speech starts:UserStartedSpeakingFrame
: Informs processors that user began speakingStartInterruptionFrame
: Triggers interruption handling (if enabled)
UserStoppedSpeakingFrame
: Signals end of user inputStopInterruptionFrame
: Resumes normal processing
Interruption Handling
Interruptions are a critical feature for natural conversations:- User starts speaking →
StartInterruptionFrame
emitted - System frame processed immediately (bypasses normal queues)
- Current processors stop and clear their queues
- Pipeline resets, ready for new user input
Best Practices
Optimal Configuration
For most voice AI use cases:Performance Considerations
- Use local VAD: 150-200ms faster than remote VAD services
- Tune for your use case: Test with real audio conditions
- Monitor CPU usage: VAD adds minimal overhead but monitor in production
- Consider turn detection: Improves conversation quality but adds complexity
Key Takeaways
- VAD detects speech activity but turn detection understands conversation context
- Configuration affects user experience - tune parameters for your specific use case
- System frames coordinate behavior - enable interruptions and natural turn-taking
- Local processing is faster - Silero VAD provides low-latency speech detection
- Turn detection improves quality - but requires careful VAD configuration
What’s Next
Now that you understand how speech input is detected and processed, let’s explore how that audio gets converted to text through speech recognition.Speech to Text
Learn how to configure speech recognition in your voice AI pipeline