Automatic Speech Recognition
NeMo
PyTorch
French
speech
audio
Transducer
FastConformer
CTC
Transformer
NeMo
Eval Results (legacy)
Instructions to use linagora/linto_stt_fr_fastconformer_pc with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use linagora/linto_stt_fr_fastconformer_pc with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("linagora/linto_stt_fr_fastconformer_pc") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
| license: cc-by-4.0 | |
| datasets: | |
| - mozilla-foundation/common_voice_17_0 | |
| - facebook/multilingual_librispeech | |
| - facebook/voxpopuli | |
| - datasets-CNRS/PFC | |
| - datasets-CNRS/CFPP | |
| - datasets-CNRS/CLAPI | |
| - gigant/african_accented_french | |
| - google/fleurs | |
| - datasets-CNRS/lesvocaux | |
| - datasets-CNRS/ACSYNT | |
| - medkit/simsamu | |
| language: | |
| - fr | |
| metrics: | |
| - wer | |
| base_model: | |
| - nvidia/stt_fr_fastconformer_hybrid_large_pc | |
| pipeline_tag: automatic-speech-recognition | |
| tags: | |
| - automatic-speech-recognition | |
| - speech | |
| - audio | |
| - Transducer | |
| - FastConformer | |
| - CTC | |
| - Transformer | |
| - pytorch | |
| - NeMo | |
| library_name: nemo | |
| model-index: | |
| - name: linto_stt_fr_fastconformer_pc | |
| results: | |
| - task: | |
| name: Automatic Speech Recognition | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: common-voice-18-0 | |
| type: mozilla-foundation/common_voice_18_0 | |
| config: fr | |
| split: test | |
| args: | |
| language: fr | |
| metrics: | |
| - name: Test WER | |
| type: wer | |
| value: 7.88 | |
| - task: | |
| type: Automatic Speech Recognition | |
| name: automatic-speech-recognition | |
| dataset: | |
| name: Multilingual LibriSpeech | |
| type: facebook/multilingual_librispeech | |
| config: french | |
| split: test | |
| args: | |
| language: fr | |
| metrics: | |
| - name: Test WER | |
| type: wer | |
| value: 4.57 | |
| - task: | |
| type: Automatic Speech Recognition | |
| name: automatic-speech-recognition | |
| dataset: | |
| name: Vox Populi | |
| type: facebook/voxpopuli | |
| config: french | |
| split: test | |
| args: | |
| language: fr | |
| metrics: | |
| - name: Test WER | |
| type: wer | |
| value: 10.14 | |
| - task: | |
| type: Automatic Speech Recognition | |
| name: automatic-speech-recognition | |
| dataset: | |
| name: SUMM-RE | |
| type: linagora/SUMM-RE | |
| config: french | |
| split: test | |
| args: | |
| language: fr | |
| metrics: | |
| - name: Test WER | |
| type: wer | |
| value: 19.8 | |
| # LinTO STT French Punctuated – FastConformer | |
| <style> | |
| img { | |
| display: inline; | |
| } | |
| </style> | |
| [](#model-architecture) | |
| [](#model-architecture) | |
| [](#datasets) | |
| --- | |
| ## Overview | |
| This model is a fine-tuned version of the [NVIDIA French FastConformer Hybrid Large model](https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc). | |
| It is a large (115M parameters) hybrid ASR model trained with both **Transducer (default)** and **CTC** losses. | |
| Compared to the base model, this version: | |
| - Was trained on **10,000+ hours** of diverse, manually transcribed French speech with punctuation. | |
| --- | |
| ## Performance | |
| The evaluation code is available in the [ASR Benchmark repository](https://github.com/linagora-labs/asr_benchmark). | |
| ### Word Error Rate (WER) | |
| WER was computed **without punctuation or uppercase letters** and datasets were cleaned. | |
| The [SUMM-RE dataset](https://huggingface.co/datasets/linagora/SUMM-RE) is the only one used **exclusively for evaluation**, meaning neither model saw it during training. | |
| Evaluations can be very long (especially for whisper) so we selected only segments with a duration over 1 second and used a subset of the test split for most datasets: | |
| - 15% of CommonVoice: 2424 rows (3.9h) | |
| - 33% of MultiLingual LibriSpeech: 800 rows (3.3h) | |
| - 33% of SUMM-RE: 1004 rows (2h). We selected only segments above 4 seconds to ensure quality. | |
| - 33% of VoxPopuli: 678 rows (1.6h) | |
| - Multilingual TEDx: 972 rows (1.5h) | |
| - 50% of our internal Youtube corpus: 956 rows (1h) | |
|  | |
| As shown in the table above (lower is better), the model demonstrates robust performance across all datasets, consistently achieving results close to the best even on the out-of-domain dataset like SUMM-RE. | |
| --- | |
| ## Usage | |
| This model can be used with the [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) for both inference and fine-tuning. | |
| ```python | |
| # Install nemo | |
| # !pip install nemo_toolkit['all'] | |
| import nemo.collections.asr as nemo_asr | |
| model_name = "linagora/linto_stt_fr_fastconformer_pc" | |
| asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name) | |
| # Path to your 16kHz mono-channel audio file | |
| audio_path = "/path/to/your/audio/file" | |
| # Transcribe with defaut transducer decoder | |
| asr_model.transcribe([audio_path]) | |
| # (Optional) Switch to CTC decoder | |
| asr_model.change_decoding_strategy(decoder_type="ctc") | |
| # (Optional) Transcribe with CTC decoder | |
| asr_model.transcribe([audio_path]) | |
| ``` | |
| It can also be used with the [LinTO STT API](https://github.com/linto-ai/linto-stt/tree/master/nemo), | |
| an Automatic Speech Recognition (ASR) API that can function either as a standalone transcription service | |
| or be deployed within a microservices infrastructure using a message broker connector. It supports both offline and real-time transcription. | |
| ## Training Details | |
| The training code is available in the [nemo_asr_training repository](https://github.com/linagora-labs/nemo_asr_training). | |
| The full configuration used for fine-tuning is available [here](https://github.com/linagora-labs/nemo_asr_training/blob/main/fastconformer/yamls/finetuning_linto_stt_fr_fastconformer_pc.yaml). | |
| ### Hardware | |
| - 1× NVIDIA H100 GPU (80 GB) | |
| ### Training Configuration | |
| - Precision: BF16 mixed precision | |
| - Max training steps: 100,000 | |
| - Gradient accumulation: 4 batches | |
| ### Tokenizer | |
| - Type: SentencePiece | |
| - Vocabulary size: 1,024 tokens | |
| ### Optimization | |
| - Optimizer: `AdamW` | |
| - Learning rate: `1e-5` | |
| - Betas: `[0.9, 0.98]` | |
| - Weight decay: `1e-3` | |
| - Scheduler: `CosineAnnealing` | |
| - Warmup steps: 10,000 | |
| - Minimum learning rate: `1e-6` | |
| ### Data Setup | |
| - Audios ranging from 0.1s to 30s | |
| - Batch size of 50 | |
| - All datasets except YODAS and TouTubeFR were upsampled 2x | |
| ### Training datasets | |
| The data were transformed, processed and converted using [NeMo tools from the SSAK repository](https://github.com/linagora-labs/ssak/tree/main/tools/nemo). YODAS segments were merged to make segments lasting up to 30. This was done in order to improve the model capabilities on longer segments. | |
| The model was trained on over 10,000 hours of French speech, covering: | |
| - Read and spontaneous speech | |
| - Conversations and meetings | |
| - Varied accents and audio conditions | |
|  | |
| Datasets Used (by size): | |
| - YouTubeFr: an internally curated corpus of CC0-licensed French-language videos sourced from YouTube. Will soon be available on LeVoiceLab platform | |
| - [YODAS](https://huggingface.co/datasets/espnet/yodas): fr000 subset | |
| - [Multilingual LibriSpeech](https://www.openslr.org/94/): french subset | |
| - [CommonVoice](https://commonvoice.mozilla.org/fr/datasets): french subset | |
| - [ESLO](http://eslo.huma-num.fr/index.php) | |
| - [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli): french subset | |
| - [Multilingual TEDx](https://www.openslr.org/100/): french subset | |
| - [TCOF](https://www.cnrtl.fr/corpus/tcof/) | |
| - CTF-AR (Corpus de Conversations Téléphoniques en Français avec Accents Régionaux): will soon be available on LeVoiceLab platform | |
| - [PFC](https://www.ortolang.fr/market/corpora/pfc) | |
| - [OFROM](https://ofrom.unine.ch/index.php?page=citations) | |
| - CTFNN1 (Corpus de Conversations Téléphoniques en Français impliquant des accents Non-Natifs): will soon be available on LeVoiceLab platform | |
| - [CFPP2000](https://www.ortolang.fr/market/corpora/cfpp2000) | |
| - [VOXFORGE](https://www.voxforge.org/) | |
| - [CLAPI](http://clapi.ish-lyon.cnrs.fr/) | |
| - [AfricanAccentedFrench](https://www.openslr.org/57/) | |
| - [FLEURS](https://huggingface.co/datasets/google/fleurs): french subset | |
| - [LesVocaux](https://www.ortolang.fr/market/corpora/lesvocaux/v0.0.1) | |
| - LINAGORA_Meetings | |
| - [CFPB](https://orfeo.ortolang.fr/annis-sample/cfpb/CFPB-1000-5.html) | |
| - [ACSYNT](https://www.ortolang.fr/market/corpora/sldr000832) | |
| - [PxSLU](https://arxiv.org/abs/2207.08292) | |
| - [SimSamu](https://huggingface.co/datasets/medkit/simsamu) | |
| ## Limitations | |
| - May struggle with rare vocabulary, heavy accents, or overlapping/multi-speaker audio. | |
| - May struggle with punctuations on long segments. | |
| ## References | |
| [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084) | |
| [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) | |
| [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) | |
| ## Acknowledgements | |
| Training of linto_stt_fr_fastconformer_pc was made possible by computing AI and storage resources by GENCI at IDRIS thanks to the grant 2025-A0181016189 on the supercomputer Jean Zay’s H100 partition. | |
| Thanks to NVIDIA for providing the base model architecture and the NeMo framework. | |
| ## Licence | |
| The model is released under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license, in line with the licensing of the original model it was fine-tuned from. |