Spark-TTS Arabic (Fine-tuned on ClArTTS)

Fine-tuned version of SparkAudio/Spark-TTS-0.5B specialized for Arabic text-to-speech synthesis. The LLM component has been fine-tuned on the ClArTTS dataset (Classical Arabic Text-to-Speech corpus) containing 12 hours of high-quality single-speaker recordings.

πŸ“‹ Model Description

Spark-TTS is a neural text-to-speech system that combines a language model (Qwen2) with a neural audio codec (BiCodec) for high-quality speech synthesis. This version has been specifically optimized for Arabic through fine-tuning on Classical Arabic speech data.

Architecture Components:

  • LLM (Qwen2): Fine-tuned for Arabic text-to-semantic token generation
  • BiCodec: Neural audio codec for semantic-to-audio token conversion (unchanged)
  • wav2vec2-large-xlsr-53: Speech encoder for voice cloning (unchanged)

What Changed: Only the LLM component was fine-tuned. The audio tokenizer and speech encoder remain identical to the base model, ensuring compatibility with the original Spark-TTS architecture.

Key Features:

  • Voice cloning with 5-30 seconds of reference audio
  • Natural prosody and intonation for Classical Arabic
  • Single-speaker consistency
  • Controllable generation parameters

🎯 Intended Use

Direct Use

  • Arabic audiobook narration (Classical/MSA)
  • Voice-over for Arabic educational content
  • Accessibility tools for Arabic text
  • Voice cloning for Arabic speakers
  • Arabic language learning applications

Downstream Use

Can be further fine-tuned for:

  • Dialectal Arabic variants (Egyptian, Levantine, Gulf)
  • Domain-specific terminology (religious texts, literature)
  • Multi-speaker scenarios
  • Emotional or expressive speech

Out-of-Scope Use

Not recommended for:

  • Real-time speech synthesis (model is relatively slow)
  • Non-diacritized Arabic text (requires tashkeel)
  • Languages other than Arabic
  • Singing or non-speech audio generation

🚨 Very Important Note: This model requires the official Spark-TTS repository for inference. The model files alone are not sufficient - you must clone the Spark-TTS repo and use their inference pipeline.

πŸš€ How to Use

Installation

First, clone the official Spark-TTS repository (required for inference):

# Clone Spark-TTS
git clone https://github.com/SparkAudio/Spark-TTS
cd Spark-TTS

# Install dependencies
pip install transformers soundfile huggingface_hub omegaconf torch

Download Model

from huggingface_hub import snapshot_download

# Download the fine-tuned model
model_dir = snapshot_download(
    repo_id="azeddinShr/Spark-TTS-Arabic-Complete",
    local_dir="./arabic_model"
)

Setup Inference Environment

import sys
import torch
import soundfile as sf

# Add Spark-TTS to path
sys.path.insert(0, './cli')

# Import SparkTTS class
from SparkTTS import SparkTTS

# Initialize device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Load Model

# Load the fine-tuned Arabic model
tts = SparkTTS("./arabic_model", device)
print("βœ… Model loaded successfully!")

Basic Text-to-Speech

# Prepare input text (must include diacritics)
text = "Ω…ΩŽΨ±Ω’Ψ­ΩŽΨ¨Ω‹Ψ§ بِكُمْ فِي Ω†ΩŽΩ…ΩΩˆΨ°ΩŽΨ¬Ω ΨͺΩŽΨ­Ω’ΩˆΩΩŠΩ„Ω Ψ§Ω„Ω†ΩŽΩ‘Ψ΅ΩΩ‘ Ψ₯ΩΩ„ΩŽΩ‰ ΩƒΩŽΩ„ΩŽΨ§Ω…Ω Ψ¨ΩΨ§Ω„Ω„ΩΩ‘ΨΊΩŽΨ©Ω Ψ§Ω„Ω’ΨΉΩŽΨ±ΩŽΨ¨ΩΩŠΩŽΩ‘Ψ©Ω."

# Reference audio and its transcript
reference_audio = "path/to/reference.wav"  # 5-30 seconds of clear Arabic speech
reference_text = "Ψ§Ω„Ω†ΩŽΩ‘Ψ΅ΩΩ‘ Ψ§Ω„Ω’Ω…ΩΨ·ΩŽΨ§Ψ¨ΩΩ‚Ω Ω„ΩΩ„Ψ΅ΩŽΩ‘ΩˆΩ’Ψͺِ Ψ§Ω„Ω’Ω…ΩŽΨ±Ω’Ψ¬ΩΨΉΩΩŠΩΩ‘"

# Generate speech
wav = tts.inference(
    text,
    prompt_speech_path=reference_audio,
    prompt_text=reference_text
)

# Save output
sf.write("output.wav", wav, samplerate=16000)
print("βœ… Audio generated!")

Advanced Generation with Parameters

# Generate with custom parameters
wav = tts.inference(
    text,
    prompt_speech_path=reference_audio,
    prompt_text=reference_text,
    temperature=0.8,      # Controls randomness (0.1-1.5, default: 0.8)
    top_k=50,            # Top-k sampling (default: 50)
    top_p=0.95           # Nucleus sampling (default: 0.95)
)

sf.write("output_custom.wav", wav, samplerate=16000)

⚠️ Important Requirements

Input Text Requirements

  • Diacritization (Tashkeel) is REQUIRED
  • Text must include full Arabic diacritics (فَΨͺΩ’Ψ­ΩŽΨ©ΨŒ ΩƒΩŽΨ³Ω’Ψ±ΩŽΨ©ΨŒ ΨΆΩŽΩ…ΩŽΩ‘Ψ©ΨŒ Ψ³ΩΩƒΩΩˆΩ†ΨŒ etc.)
  • Use AI tools (ChatGPT, Claude) or online diacritizers to add tashkeel

Example:

  • ❌ Bad: "Ω…Ψ±Ψ­Ψ¨Ψ§ Ψ¨ΩƒΩ… في Ψ§Ω„Ω†Ω…ΩˆΨ°Ψ¬"
  • βœ… Good: "Ω…ΩŽΨ±Ω’Ψ­ΩŽΨ¨Ω‹Ψ§ بِكُمْ فِي Ψ§Ω„Ω†ΩŽΩ‘Ω…ΩΩˆΨ°ΩŽΨ¬Ω"

Reference Audio Requirements

  • Duration: 5-30 seconds of clear speech
  • Quality: Clean recording, minimal background noise
  • Speaker: Single speaker only
  • Language: Arabic (preferably MSA or Classical)
  • Format: WAV file recommended

Reference Transcript Requirements

  • Must match reference audio exactly
  • Must include full diacritics
  • Text alignment is critical for quality

πŸ“Š Training Details

Training Data

Dataset: MBZUAI/ClArTTS

  • Full dataset size: 12 hours, 9,500 utterances
  • Training subset: 30% (~2,850 utterances)
  • Speaker: Single male speaker
  • Language: Classical Arabic (MSA)
  • Sample rate: 40.1 kHz (resampled to 24 kHz for training)
  • Text quality: Fully diacritized

Training Procedure

Fine-tuning Framework: Axolotl + LoRA

Training Configuration:

Base Model: SparkAudio/Spark-TTS-0.5B (LLM component only)
Fine-tuning Method: Full fine-tuning (not LoRA)
Epochs: 20
Batch Size: 8 (1 per device Γ— 8 gradient accumulation)
Learning Rate: 2e-4
Optimizer: AdamW (torch fused)
LR Scheduler: Cosine
Warmup Steps: 10
Sequence Length: 1024
Precision: bfloat16
Gradient Checkpointing: Enabled

Data Processing:

  1. Audio resampled to 24 kHz
  2. Semantic tokens extracted using BiCodec
  3. Training pairs: [text, semantic_tokens] created for LLM training
  4. Text normalized to lowercase during processing

Training Infrastructure:

  • Hardware: Single NVIDIA GPU (Colab)
  • Training Time: ~3-4 hours
  • Framework: PyTorch + Transformers + Axolotl

Data Preparation Steps:

# 1. Load ClArTTS from HuggingFace
# 2. Resample audio from 40.1 kHz β†’ 24 kHz
# 3. Extract semantic tokens using BiCodec
# 4. Create metadata: [audio_path, text]
# 5. Generate training pairs: [text β†’ semantic_tokens]

Base Model:

@misc{sparktts2024,
  title={Spark-TTS: Zero-Shot Multi-Style Text-to-Speech via Large Language Models},
  author={SparkAudio Team},
  year={2024},
  url={https://github.com/SparkAudio/Spark-TTS}
}

Training Dataset:

@inproceedings{kulkarni2023clartts,
  author={Ajinkya Kulkarni and Atharva Kulkarni and Sara Shatnawi and Hanan Aldarmaki},
  title={ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus},
  year={2023},
  booktitle={INTERSPEECH 2023},
  pages={5511--5515},
  doi={10.21437/Interspeech.2023-2224}
}

πŸ‘ Acknowledgments

  • Base Model: SparkAudio Team for Spark-TTS-0.5B
  • Dataset: MBZUAI for ClArTTS corpus
  • Frameworks: Hugging Face Transformers, Axolotl, PyTorch

πŸ“„ License

Apache 2.0 (same as base model)

πŸ“§ Contact

For questions, collaboration, or support:


Note: This model requires the official Spark-TTS repository for inference. The model files alone are not sufficient - you must clone the Spark-TTS repo and use their inference pipeline.

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for azeddinShr/Spark-TTS-Arabic-Complete

Finetuned
(18)
this model

Dataset used to train azeddinShr/Spark-TTS-Arabic-Complete