Spark-TTS Arabic (Fine-tuned on ClArTTS)

Fine-tuned version of SparkAudio/Spark-TTS-0.5B specialized for Arabic text-to-speech synthesis. The LLM component has been fine-tuned on the ClArTTS dataset (Classical Arabic Text-to-Speech corpus) containing 12 hours of high-quality single-speaker recordings.

📋 Model Description

Spark-TTS is a neural text-to-speech system that combines a language model (Qwen2) with a neural audio codec (BiCodec) for high-quality speech synthesis. This version has been specifically optimized for Arabic through fine-tuning on Classical Arabic speech data.

Architecture Components:

LLM (Qwen2): Fine-tuned for Arabic text-to-semantic token generation
BiCodec: Neural audio codec for semantic-to-audio token conversion (unchanged)
wav2vec2-large-xlsr-53: Speech encoder for voice cloning (unchanged)

What Changed: Only the LLM component was fine-tuned. The audio tokenizer and speech encoder remain identical to the base model, ensuring compatibility with the original Spark-TTS architecture.

Key Features:

Voice cloning with 5-30 seconds of reference audio
Natural prosody and intonation for Classical Arabic
Single-speaker consistency
Controllable generation parameters

🎯 Intended Use

Direct Use

Arabic audiobook narration (Classical/MSA)
Voice-over for Arabic educational content
Accessibility tools for Arabic text
Voice cloning for Arabic speakers
Arabic language learning applications

Downstream Use

Can be further fine-tuned for:

Dialectal Arabic variants (Egyptian, Levantine, Gulf)
Domain-specific terminology (religious texts, literature)
Multi-speaker scenarios
Emotional or expressive speech

Out-of-Scope Use

Not recommended for:

Real-time speech synthesis (model is relatively slow)
Non-diacritized Arabic text (requires tashkeel)
Languages other than Arabic
Singing or non-speech audio generation

🚨 Very Important Note: This model requires the official Spark-TTS repository for inference. The model files alone are not sufficient - you must clone the Spark-TTS repo and use their inference pipeline.

🚀 How to Use

Installation

First, clone the official Spark-TTS repository (required for inference):

# Clone Spark-TTS
git clone https://github.com/SparkAudio/Spark-TTS
cd Spark-TTS

# Install dependencies
pip install transformers soundfile huggingface_hub omegaconf torch

Download Model

from huggingface_hub import snapshot_download

# Download the fine-tuned model
model_dir = snapshot_download(
    repo_id="azeddinShr/Spark-TTS-Arabic-Complete",
    local_dir="./arabic_model"
)

Setup Inference Environment

import sys
import torch
import soundfile as sf

# Add Spark-TTS to path
sys.path.insert(0, './cli')

# Import SparkTTS class
from SparkTTS import SparkTTS

# Initialize device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Load Model

# Load the fine-tuned Arabic model
tts = SparkTTS("./arabic_model", device)
print("✅ Model loaded successfully!")

Basic Text-to-Speech

# Prepare input text (must include diacritics)
text = "مَرْحَبًا بِكُمْ فِي نَمُوذَجِ تَحْوِيلِ النَّصِّ إِلَى كَلَامٍ بِاللُّغَةِ الْعَرَبِيَّةِ."

# Reference audio and its transcript
reference_audio = "path/to/reference.wav"  # 5-30 seconds of clear Arabic speech
reference_text = "النَّصُّ الْمُطَابِقُ لِلصَّوْتِ الْمَرْجِعِيِّ"

# Generate speech
wav = tts.inference(
    text,
    prompt_speech_path=reference_audio,
    prompt_text=reference_text
)

# Save output
sf.write("output.wav", wav, samplerate=16000)
print("✅ Audio generated!")

Advanced Generation with Parameters

# Generate with custom parameters
wav = tts.inference(
    text,
    prompt_speech_path=reference_audio,
    prompt_text=reference_text,
    temperature=0.8,      # Controls randomness (0.1-1.5, default: 0.8)
    top_k=50,            # Top-k sampling (default: 50)
    top_p=0.95           # Nucleus sampling (default: 0.95)
)

sf.write("output_custom.wav", wav, samplerate=16000)

⚠️ Important Requirements

Input Text Requirements

Diacritization (Tashkeel) is REQUIRED
Text must include full Arabic diacritics (فَتْحَة، كَسْرَة، ضَمَّة، سُكُون، etc.)
Use AI tools (ChatGPT, Claude) or online diacritizers to add tashkeel

Example:

❌ Bad: "مرحبا بكم في النموذج"
✅ Good: "مَرْحَبًا بِكُمْ فِي النَّمُوذَجِ"

Reference Audio Requirements

Duration: 5-30 seconds of clear speech
Quality: Clean recording, minimal background noise
Speaker: Single speaker only
Language: Arabic (preferably MSA or Classical)
Format: WAV file recommended

Reference Transcript Requirements

Must match reference audio exactly
Must include full diacritics
Text alignment is critical for quality

📊 Training Details

Training Data

Dataset: MBZUAI/ClArTTS

Full dataset size: 12 hours, 9,500 utterances
Training subset: 30% (~2,850 utterances)
Speaker: Single male speaker
Language: Classical Arabic (MSA)
Sample rate: 40.1 kHz (resampled to 24 kHz for training)
Text quality: Fully diacritized

Training Procedure

Fine-tuning Framework: Axolotl + LoRA

Training Configuration:

Base Model: SparkAudio/Spark-TTS-0.5B (LLM component only)
Fine-tuning Method: Full fine-tuning (not LoRA)
Epochs: 20
Batch Size: 8 (1 per device × 8 gradient accumulation)
Learning Rate: 2e-4
Optimizer: AdamW (torch fused)
LR Scheduler: Cosine
Warmup Steps: 10
Sequence Length: 1024
Precision: bfloat16
Gradient Checkpointing: Enabled

Data Processing:

Audio resampled to 24 kHz
Semantic tokens extracted using BiCodec
Training pairs: [text, semantic_tokens] created for LLM training
Text normalized to lowercase during processing

Training Infrastructure:

Hardware: Single NVIDIA GPU (Colab)
Training Time: ~3-4 hours
Framework: PyTorch + Transformers + Axolotl

Data Preparation Steps:

# 1. Load ClArTTS from HuggingFace
# 2. Resample audio from 40.1 kHz → 24 kHz
# 3. Extract semantic tokens using BiCodec
# 4. Create metadata: [audio_path, text]
# 5. Generate training pairs: [text → semantic_tokens]

Base Model:

@misc{sparktts2024,
  title={Spark-TTS: Zero-Shot Multi-Style Text-to-Speech via Large Language Models},
  author={SparkAudio Team},
  year={2024},
  url={https://github.com/SparkAudio/Spark-TTS}
}

Training Dataset:

@inproceedings{kulkarni2023clartts,
  author={Ajinkya Kulkarni and Atharva Kulkarni and Sara Shatnawi and Hanan Aldarmaki},
  title={ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus},
  year={2023},
  booktitle={INTERSPEECH 2023},
  pages={5511--5515},
  doi={10.21437/Interspeech.2023-2224}
}

👍 Acknowledgments

Base Model: SparkAudio Team for Spark-TTS-0.5B
Dataset: MBZUAI for ClArTTS corpus
Frameworks: Hugging Face Transformers, Axolotl, PyTorch

📄 License

Apache 2.0 (same as base model)

📧 Contact

For questions, collaboration, or support:

Email: [email protected]
Hugging Face: @azeddinShr
Model Discussions: Use the Community tab above

Note: This model requires the official Spark-TTS repository for inference. The model files alone are not sufficient - you must clone the Spark-TTS repo and use their inference pipeline.

Downloads last month: 7

Model tree for azeddinShr/Spark-TTS-Arabic-Complete

Base model

SparkAudio/Spark-TTS-0.5B

Finetuned

(18)

this model

azeddinShr
/

Spark-TTS-Arabic-Complete