Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
Paper • 2305.05084 • Published • 4
CoreML export of nvidia/parakeet-tdt_ctc-110m for on-device speech recognition on Apple Silicon via FluidAudio.
| File | Size | Description |
|---|---|---|
Preprocessor.mlmodelc |
207 MB | Fused mel-spectrogram + FastConformer encoder |
Decoder.mlmodelc |
7.5 MB | 1-layer LSTM prediction network |
JointDecision.mlmodelc |
2.7 MB | Single-step joint network (token + duration) |
parakeet_vocab.json |
18 KB | 1024-token BPE vocabulary |
config.json |
2.5 KB | Model metadata and I/O contracts |
Input: 16 kHz mono audio, fixed 15-second window (240,000 samples). Output: Token IDs, probabilities, and TDT duration predictions per encoder frame.
Benchmarked with FluidAudio CLI on Apple M2 (release build):
| Benchmark | WER |
|---|---|
| LibriSpeech test-clean | 3.0% |
| RTFx (overall) | 102x real-time |
| Peak memory | 0.3 GB |
NVIDIA's reference WER (greedy, GPU):
| Benchmark | WER |
|---|---|
| LibriSpeech test-clean | 2.4% |
| LibriSpeech test-other | 5.2% |
| AMI | 15.88% |
| Earnings-22 | 12.42% |
| GigaSpeech | 10.52% |
| TEDLIUM-v3 | 4.16% |
# Transcribe
fluidaudiocli transcribe audio.wav --model-version tdt-ctc-110m
# Benchmark
fluidaudiocli asr-benchmark --subset test-clean --model-version tdt-ctc-110m
Models auto-download from this repo on first use. To pre-fetch:
fluidaudiocli download --model-version tdt-ctc-110m
Exported from NeMo using mobius/models/stt/parakeet-tdt-ctc-110m/coreml/convert-tdt-coreml.py:
Base model
nvidia/parakeet-tdt_ctc-110m