Higgs Audio v3 TTS - API-First High-Performance Inference Engine

This repository provides a production-grade, API-first deployment of the Higgs Audio v3 (4B) Text-to-Speech and Voice Cloning model, optimized for Nvidia Blackwell (B200) and Hopper (H100) architectures.

The system is decoupled into two primary layers:

Core Production API: An OpenAI-compatible HTTP REST API served by SGLang-Omni on port 8000.
Interactive UI: A lightweight Gradio web client on port 7860 that acts as a pure frontend interface.

Inference Optimization Suite

To achieve sub-second generation latencies and state-of-the-art throughput, we replaced standard eager-mode execution with a heavily optimized serving pipeline.

1. SGLang-Omni Serving Backend

By hosting the causal LM backbone inside SGLang-Omni, we pipeline the text preprocessing, audio tokenization, autoregressive decoding, and vocoder generation. This eliminates multi-process GIL bottlenecking.

2. Advanced Attention Accelerators

FlashAttention-3 (via FlashInfer): Leverages highly optimized attention kernels specifically tailored for Blackwell and Hopper GPUs to accelerate long context sequences.
SageAttention: Seamlessly integrates fast quantization-aware attention kernels to improve decode speed during sequence generation.
Paged Attention: Dynamically manages key-value (KV) cache memory fragmentation with variable page allocations, reducing memory footprint.

3. Precompiled JIT Caches

Bypasses runtime compilation of customized CUDA templates at server startup. We utilize precompiled flashinfer binaries synchronized with CUDA 12.8, guaranteeing immediate high-performance serving from boot.

4. CUDA Graph Batch Capture

Translates dynamic execution traces into static CUDA graphs during initialization. Running decoding iterations within pre-captured graphs removes CPU launch overhead, fully utilizing tensor cores.

5. Whisper ASR for Voice Cloning

Includes an in-process, high-speed Whisper speech-to-text model that transcribes reference voice samples in real time. Providing highly accurate textual transcripts alongside reference audio improves the zero-shot similarity score of cloned voices.

Architecture & Communication Flow

[User Input] --> [Gradio Web UI (Port 7860)]
                        |
              (JSON HTTP POST Request)
                        v
         [SGLang-Omni Serve Core (Port 8000)]
         +----------------------------------+
         | - Whisper Reference ASR          |
         | - Autoregressive Causal LM       |
         | - FlashAttention-3 & SageAttn    |
         | - Unified bfloat16 Audio Vocoder |
         +----------------------------------+
                        |
               (Binary WAV Bytes)
                        v
[User Input] <-- [Gradio Web UI (Port 7860)]

Running the Application

Step 1: Start the SGLang-Omni Server

/workspace/higgs-audio-v3-tts/.venv/bin/sgl-omni serve \
  --model-path bosonai/higgs-audio-v3-tts-4b \
  --port 8000

Step 2: Run the Gradio Client

/workspace/higgs-audio-v3-tts/.venv/bin/python app.py

Downloads last month: 26

Safetensors

Model size

0.2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support