Parakeet-TDT-CTC 110M — CoreML

CoreML export of nvidia/parakeet-tdt_ctc-110m for on-device speech recognition on Apple Silicon via FluidAudio.

CoreML Components

File Size Description
Preprocessor.mlmodelc 207 MB Fused mel-spectrogram + FastConformer encoder
Decoder.mlmodelc 7.5 MB 1-layer LSTM prediction network
JointDecision.mlmodelc 2.7 MB Single-step joint network (token + duration)
parakeet_vocab.json 18 KB 1024-token BPE vocabulary
config.json 2.5 KB Model metadata and I/O contracts

Input: 16 kHz mono audio, fixed 15-second window (240,000 samples). Output: Token IDs, probabilities, and TDT duration predictions per encoder frame.

Performance

Benchmarked with FluidAudio CLI on Apple M2 (release build):

Benchmark WER
LibriSpeech test-clean 3.0%
RTFx (overall) 102x real-time
Peak memory 0.3 GB

NVIDIA's reference WER (greedy, GPU):

Benchmark WER
LibriSpeech test-clean 2.4%
LibriSpeech test-other 5.2%
AMI 15.88%
Earnings-22 12.42%
GigaSpeech 10.52%
TEDLIUM-v3 4.16%

Usage with FluidAudio

# Transcribe
fluidaudiocli transcribe audio.wav --model-version tdt-ctc-110m

# Benchmark
fluidaudiocli asr-benchmark --subset test-clean --model-version tdt-ctc-110m

Models auto-download from this repo on first use. To pre-fetch:

fluidaudiocli download --model-version tdt-ctc-110m

Conversion

Exported from NeMo using mobius/models/stt/parakeet-tdt-ctc-110m/coreml/convert-tdt-coreml.py:

  • Preprocessor fuses mel-spectrogram extraction and the FastConformer encoder into a single CoreML model
  • JointDecision is the single-step variant (encoder_step + decoder_step inputs) used by FluidAudio's TDT decoder
  • All models exported as MLProgram (iOS 17+ / macOS 14+), float32 precision

References

Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FluidInference/parakeet-tdt-ctc-110m-coreml

Quantized
(3)
this model

Papers for FluidInference/parakeet-tdt-ctc-110m-coreml