FSMN-VAD

Voice Activity Detection β€” accurately detect speech segments in audio, essential for long-audio processing pipelines.

FSMN-VAD uses a Feedforward Sequential Memory Network to detect speech/non-speech boundaries with high precision and low latency. It supports both streaming and offline modes.

Quick Start

from funasr import AutoModel

# Standalone VAD
model = AutoModel(model="funasr/fsmn-vad", hub="hf", device="cuda")
result = model.generate(input="long_audio.wav")
# Returns speech segments: [[start_ms, end_ms], [start_ms, end_ms], ...]
print(result[0]["value"])

Use as Part of ASR Pipeline

from funasr import AutoModel

# VAD automatically segments long audio before ASR
model = AutoModel(
    model="funasr/paraformer-zh",
    hub="hf",
    vad_model="funasr/fsmn-vad",
    device="cuda",
)
result = model.generate(input="meeting_2hours.wav")
print(result[0]["text"])

Features

  • Streaming and offline voice activity detection
  • Configurable segment length (max_single_segment_time)
  • Low latency for real-time applications
  • Works with all FunASR ASR models as a preprocessing step

Model Details

Property Value
Architecture FSMN (Feedforward Sequential Memory Network)
Sample Rate 16kHz
Modes Streaming + Offline

Links

Downloads last month
1,058
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using funasr/fsmn-vad 8