Spark-TTS Arabic (Fine-tuned on ClArTTS)
Fine-tuned version of SparkAudio/Spark-TTS-0.5B specialized for Arabic text-to-speech synthesis. The LLM component has been fine-tuned on the ClArTTS dataset (Classical Arabic Text-to-Speech corpus) containing 12 hours of high-quality single-speaker recordings.
π Model Description
Spark-TTS is a neural text-to-speech system that combines a language model (Qwen2) with a neural audio codec (BiCodec) for high-quality speech synthesis. This version has been specifically optimized for Arabic through fine-tuning on Classical Arabic speech data.
Architecture Components:
- LLM (Qwen2): Fine-tuned for Arabic text-to-semantic token generation
- BiCodec: Neural audio codec for semantic-to-audio token conversion (unchanged)
- wav2vec2-large-xlsr-53: Speech encoder for voice cloning (unchanged)
What Changed: Only the LLM component was fine-tuned. The audio tokenizer and speech encoder remain identical to the base model, ensuring compatibility with the original Spark-TTS architecture.
Key Features:
- Voice cloning with 5-30 seconds of reference audio
- Natural prosody and intonation for Classical Arabic
- Single-speaker consistency
- Controllable generation parameters
π― Intended Use
Direct Use
- Arabic audiobook narration (Classical/MSA)
- Voice-over for Arabic educational content
- Accessibility tools for Arabic text
- Voice cloning for Arabic speakers
- Arabic language learning applications
Downstream Use
Can be further fine-tuned for:
- Dialectal Arabic variants (Egyptian, Levantine, Gulf)
- Domain-specific terminology (religious texts, literature)
- Multi-speaker scenarios
- Emotional or expressive speech
Out-of-Scope Use
Not recommended for:
- Real-time speech synthesis (model is relatively slow)
- Non-diacritized Arabic text (requires tashkeel)
- Languages other than Arabic
- Singing or non-speech audio generation
π¨ Very Important Note: This model requires the official Spark-TTS repository for inference. The model files alone are not sufficient - you must clone the Spark-TTS repo and use their inference pipeline.
π How to Use
Installation
First, clone the official Spark-TTS repository (required for inference):
# Clone Spark-TTS
git clone https://github.com/SparkAudio/Spark-TTS
cd Spark-TTS
# Install dependencies
pip install transformers soundfile huggingface_hub omegaconf torch
Download Model
from huggingface_hub import snapshot_download
# Download the fine-tuned model
model_dir = snapshot_download(
repo_id="azeddinShr/Spark-TTS-Arabic-Complete",
local_dir="./arabic_model"
)
Setup Inference Environment
import sys
import torch
import soundfile as sf
# Add Spark-TTS to path
sys.path.insert(0, './cli')
# Import SparkTTS class
from SparkTTS import SparkTTS
# Initialize device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
Load Model
# Load the fine-tuned Arabic model
tts = SparkTTS("./arabic_model", device)
print("β
Model loaded successfully!")
Basic Text-to-Speech
# Prepare input text (must include diacritics)
text = "Ω
ΩΨ±ΩΨΩΨ¨ΩΨ§ Ψ¨ΩΩΩΩ
Ω ΩΩΩ ΩΩΩ
ΩΩΨ°ΩΨ¬Ω ΨͺΩΨΩΩΩΩΩΩ Ψ§ΩΩΩΩΨ΅ΩΩ Ψ₯ΩΩΩΩ ΩΩΩΩΨ§Ω
Ω Ψ¨ΩΨ§ΩΩΩΩΨΊΩΨ©Ω Ψ§ΩΩΨΉΩΨ±ΩΨ¨ΩΩΩΩΨ©Ω."
# Reference audio and its transcript
reference_audio = "path/to/reference.wav" # 5-30 seconds of clear Arabic speech
reference_text = "Ψ§ΩΩΩΩΨ΅ΩΩ Ψ§ΩΩΩ
ΩΨ·ΩΨ§Ψ¨ΩΩΩ ΩΩΩΨ΅ΩΩΩΩΨͺΩ Ψ§ΩΩΩ
ΩΨ±ΩΨ¬ΩΨΉΩΩΩΩ"
# Generate speech
wav = tts.inference(
text,
prompt_speech_path=reference_audio,
prompt_text=reference_text
)
# Save output
sf.write("output.wav", wav, samplerate=16000)
print("β
Audio generated!")
Advanced Generation with Parameters
# Generate with custom parameters
wav = tts.inference(
text,
prompt_speech_path=reference_audio,
prompt_text=reference_text,
temperature=0.8, # Controls randomness (0.1-1.5, default: 0.8)
top_k=50, # Top-k sampling (default: 50)
top_p=0.95 # Nucleus sampling (default: 0.95)
)
sf.write("output_custom.wav", wav, samplerate=16000)
β οΈ Important Requirements
Input Text Requirements
- Diacritization (Tashkeel) is REQUIRED
- Text must include full Arabic diacritics (ΩΩΨͺΩΨΩΨ©Ψ ΩΩΨ³ΩΨ±ΩΨ©Ψ ΨΆΩΩ ΩΩΨ©Ψ Ψ³ΩΩΩΩΩΨ etc.)
- Use AI tools (ChatGPT, Claude) or online diacritizers to add tashkeel
Example:
- β Bad: "Ω Ψ±ΨΨ¨Ψ§ Ψ¨ΩΩ ΩΩ Ψ§ΩΩΩ ΩΨ°Ψ¬"
- β Good: "Ω ΩΨ±ΩΨΩΨ¨ΩΨ§ Ψ¨ΩΩΩΩ Ω ΩΩΩ Ψ§ΩΩΩΩΩ ΩΩΨ°ΩΨ¬Ω"
Reference Audio Requirements
- Duration: 5-30 seconds of clear speech
- Quality: Clean recording, minimal background noise
- Speaker: Single speaker only
- Language: Arabic (preferably MSA or Classical)
- Format: WAV file recommended
Reference Transcript Requirements
- Must match reference audio exactly
- Must include full diacritics
- Text alignment is critical for quality
π Training Details
Training Data
Dataset: MBZUAI/ClArTTS
- Full dataset size: 12 hours, 9,500 utterances
- Training subset: 30% (~2,850 utterances)
- Speaker: Single male speaker
- Language: Classical Arabic (MSA)
- Sample rate: 40.1 kHz (resampled to 24 kHz for training)
- Text quality: Fully diacritized
Training Procedure
Fine-tuning Framework: Axolotl + LoRA
Training Configuration:
Base Model: SparkAudio/Spark-TTS-0.5B (LLM component only)
Fine-tuning Method: Full fine-tuning (not LoRA)
Epochs: 20
Batch Size: 8 (1 per device Γ 8 gradient accumulation)
Learning Rate: 2e-4
Optimizer: AdamW (torch fused)
LR Scheduler: Cosine
Warmup Steps: 10
Sequence Length: 1024
Precision: bfloat16
Gradient Checkpointing: Enabled
Data Processing:
- Audio resampled to 24 kHz
- Semantic tokens extracted using BiCodec
- Training pairs:
[text, semantic_tokens]created for LLM training - Text normalized to lowercase during processing
Training Infrastructure:
- Hardware: Single NVIDIA GPU (Colab)
- Training Time: ~3-4 hours
- Framework: PyTorch + Transformers + Axolotl
Data Preparation Steps:
# 1. Load ClArTTS from HuggingFace
# 2. Resample audio from 40.1 kHz β 24 kHz
# 3. Extract semantic tokens using BiCodec
# 4. Create metadata: [audio_path, text]
# 5. Generate training pairs: [text β semantic_tokens]
Base Model:
@misc{sparktts2024,
title={Spark-TTS: Zero-Shot Multi-Style Text-to-Speech via Large Language Models},
author={SparkAudio Team},
year={2024},
url={https://github.com/SparkAudio/Spark-TTS}
}
Training Dataset:
@inproceedings{kulkarni2023clartts,
author={Ajinkya Kulkarni and Atharva Kulkarni and Sara Shatnawi and Hanan Aldarmaki},
title={ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus},
year={2023},
booktitle={INTERSPEECH 2023},
pages={5511--5515},
doi={10.21437/Interspeech.2023-2224}
}
π Acknowledgments
- Base Model: SparkAudio Team for Spark-TTS-0.5B
- Dataset: MBZUAI for ClArTTS corpus
- Frameworks: Hugging Face Transformers, Axolotl, PyTorch
π License
Apache 2.0 (same as base model)
π§ Contact
For questions, collaboration, or support:
- Email: [email protected]
- Hugging Face: @azeddinShr
- Model Discussions: Use the Community tab above
Note: This model requires the official Spark-TTS repository for inference. The model files alone are not sufficient - you must clone the Spark-TTS repo and use their inference pipeline.
- Downloads last month
- 7
Model tree for azeddinShr/Spark-TTS-Arabic-Complete
Base model
SparkAudio/Spark-TTS-0.5B