Speech Emotion Recognition (6-class)
PyTorch CRNN with attention for speech emotion recognition from log-mel spectrograms. Predicts one of Anger, Happy, Neutral, Sad, Fear, Surprise. Trained on 8 datasets in 3 languages (English, Polish, German), including acted speech and singing, with distance-weighted loss and data augmentation.
| Architecture | 3× CNN blocks → BiLSTM (2 layers, 128 hidden) → Attention → FC |
| Input | Log-mel spectrogram, 96 mels × 172 time frames (~4 s @ 22.05 kHz) |
| Output | Logits over 6 emotions |
| 6-class performance | 76.76% validation accuracy, 82.80% macro F1 |
| Inference | ~10–20 ms per sample (real-time capable) |
Trained on CREMA-D, RAVDESS (speech + songs), SAVEE, TESS, IEMOCAP, nEMO, EmoDB (~25.8k samples).
Quick start
Clone the repo and run inference (recommended):
git clone https://github.com/willchristophersander/SpeechEmotionCS3540
cd SpeechEmotionCS3540
pip install torch librosa numpy huggingface_hub noisereduce
python scripts/huggingface/load_and_run_from_hub.py --repo-id williamsander/speech-emotion-crnn-6class --audio your_audio.wav
Replace williamsander/speech-emotion-crnn-6class with this model’s repo id (e.g. username/speech-emotion-crnn-6class).
Use in your own code
Install: torch, librosa, numpy, huggingface_hub. You need the model class from the project repo (fly-app/ser/models/crnn_6class.py). Then:
from huggingface_hub import hf_hub_download
import torch, json
repo_id = "williamsander/speech-emotion-crnn-6class"
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")
with open(config_path) as f:
config = json.load(f)
# Use CRNN_6Class from the project repo (see link above)
from crnn_6class import CRNN_6Class
model = CRNN_6Class(n_mels=config["n_mels"], dropout=config["dropout"])
state = torch.load(weights_path, map_location="cpu")
model.load_state_dict(state.get("model_state_dict") or state.get("model_state") or state)
model.eval()
Preprocessing (same as training): resample to 22.05 kHz, compute 96-bin log-mel spectrogram with hop_length=512, n_fft=2048, normalize to [-1, 1], then pad/trim to 172 time frames. See config.json for exact parameters and the repo for a full pipeline.
Report
Citation
Part of the SpeechEmotionCS3540 open-source project.
License
MIT
- Downloads last month
- 19