FSMN-VAD

Voice Activity Detection — accurately detect speech segments in audio, essential for long-audio processing pipelines.

FSMN-VAD uses a Feedforward Sequential Memory Network to detect speech/non-speech boundaries with high precision and low latency. It supports both streaming and offline modes.

Quick Start

from funasr import AutoModel

# Standalone VAD
model = AutoModel(model="funasr/fsmn-vad", hub="hf", device="cuda")
result = model.generate(input="long_audio.wav")
# Returns speech segments: [[start_ms, end_ms], [start_ms, end_ms], ...]
print(result[0]["value"])

Use as Part of ASR Pipeline

from funasr import AutoModel

# VAD automatically segments long audio before ASR
model = AutoModel(
    model="funasr/paraformer-zh",
    hub="hf",
    vad_model="funasr/fsmn-vad",
    device="cuda",
)
result = model.generate(input="meeting_2hours.wav")
print(result[0]["text"])

Features

Streaming and offline voice activity detection
Configurable segment length (max_single_segment_time)
Low latency for real-time applications
Works with all FunASR ASR models as a preprocessing step

Model Details

Property	Value
Architecture	FSMN (Feedforward Sequential Memory Network)
Sample Rate	16kHz
Modes	Streaming + Offline

funasr
/

fsmn-vad

FSMN-VAD

Quick Start

Use as Part of ASR Pipeline

Features

Model Details

Links

Spaces using funasr/fsmn-vad 8