FSMN-VAD
Voice Activity Detection β accurately detect speech segments in audio, essential for long-audio processing pipelines.
FSMN-VAD uses a Feedforward Sequential Memory Network to detect speech/non-speech boundaries with high precision and low latency. It supports both streaming and offline modes.
Quick Start
from funasr import AutoModel
# Standalone VAD
model = AutoModel(model="funasr/fsmn-vad", hub="hf", device="cuda")
result = model.generate(input="long_audio.wav")
# Returns speech segments: [[start_ms, end_ms], [start_ms, end_ms], ...]
print(result[0]["value"])
Use as Part of ASR Pipeline
from funasr import AutoModel
# VAD automatically segments long audio before ASR
model = AutoModel(
model="funasr/paraformer-zh",
hub="hf",
vad_model="funasr/fsmn-vad",
device="cuda",
)
result = model.generate(input="meeting_2hours.wav")
print(result[0]["text"])
Features
- Streaming and offline voice activity detection
- Configurable segment length (
max_single_segment_time) - Low latency for real-time applications
- Works with all FunASR ASR models as a preprocessing step
Model Details
| Property | Value |
|---|---|
| Architecture | FSMN (Feedforward Sequential Memory Network) |
| Sample Rate | 16kHz |
| Modes | Streaming + Offline |
Links
- GitHub: FunASR
- Docs: modelscope.github.io/FunASR
- Downloads last month
- 1,058
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support