Abstract
Woosh is a sound effect foundation model featuring audio encoding/decoding, text-audio alignment, and text-to-audio/video-to-audio generation capabilities with distilled versions for efficient deployment.
The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models (2026)
- MOVA: Towards Scalable and Synchronized Video-Audio Generation (2026)
- Diffusion Models for Joint Audio-Video Generation (2026)
- UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation (2026)
- DashengTokenizer: One layer is enough for unified audio understanding and generation (2026)
- SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents (2026)
- TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.01929 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper