Speech Emotion Recognition (6-class)

PyTorch CRNN with attention for speech emotion recognition from log-mel spectrograms. Predicts one of Anger, Happy, Neutral, Sad, Fear, Surprise. Trained on 8 datasets in 3 languages (English, Polish, German), including acted speech and singing, with distance-weighted loss and data augmentation.


Architecture	3× CNN blocks → BiLSTM (2 layers, 128 hidden) → Attention → FC
Input	Log-mel spectrogram, 96 mels × 172 time frames (~4 s @ 22.05 kHz)
Output	Logits over 6 emotions
6-class performance	76.76% validation accuracy, 82.80% macro F1
Inference	~10–20 ms per sample (real-time capable)

Trained on CREMA-D, RAVDESS (speech + songs), SAVEE, TESS, IEMOCAP, nEMO, EmoDB (~25.8k samples).

Quick start

Clone the repo and run inference (recommended):

git clone https://github.com/willchristophersander/SpeechEmotionCS3540
cd SpeechEmotionCS3540
pip install torch librosa numpy huggingface_hub noisereduce
python scripts/huggingface/load_and_run_from_hub.py --repo-id williamsander/speech-emotion-crnn-6class --audio your_audio.wav

Replace williamsander/speech-emotion-crnn-6class with this model’s repo id (e.g. username/speech-emotion-crnn-6class).

Use in your own code

Install: torch, librosa, numpy, huggingface_hub. You need the model class from the project repo (fly-app/ser/models/crnn_6class.py). Then:

from huggingface_hub import hf_hub_download
import torch, json

repo_id = "williamsander/speech-emotion-crnn-6class"
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")

with open(config_path) as f:
    config = json.load(f)

# Use CRNN_6Class from the project repo (see link above)
from crnn_6class import CRNN_6Class
model = CRNN_6Class(n_mels=config["n_mels"], dropout=config["dropout"])
state = torch.load(weights_path, map_location="cpu")
model.load_state_dict(state.get("model_state_dict") or state.get("model_state") or state)
model.eval()

Preprocessing (same as training): resample to 22.05 kHz, compute 96-bin log-mel spectrogram with hop_length=512, n_fft=2048, normalize to [-1, 1], then pad/trim to 172 time frames. See config.json for exact parameters and the repo for a full pipeline.

Report

Full report (PDF)

Citation

Part of the SpeechEmotionCS3540 open-source project.

License

MIT

Downloads last month: 19

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support