Zenyx V3 Base (1.5B Mixture-of-Experts)
Zenyx V3 is a highly efficient 1.5B parameter Mixture-of-Experts (MoE) foundation model designed for low-latency inference, high throughput, and context length extensibility.
Current Training Status: Zenyx V3 is currently under active pretraining on a diverse corpus of web text, mathematics, and programming code. The final pretrained base model and supervised fine-tuned (SFT) chat models will be released soon. Stay tuned!
Model Architecture
Zenyx V3 incorporates several cutting-edge architectural enhancements:
- Sparse Mixture-of-Experts (MoE): 12 routed experts + 1 shared expert. Evaluates exactly 2 active experts per token using a differential Sinkhorn transport-based gating system.
- Multi-head Latent Attention (MLA): Compresses the Key-Value (KV) cache into a low-rank latent subspace, reducing high-bandwidth memory (HBM) bandwidth bottlenecks and memory footprint.
- Hyper-Connections: Optimized using Sinkhorn normalization iterations across residual connections to stabilize gradients at scale.
- Progressive Context Extension: Precomputed YaRN and RoPE scaling supporting context lengths of up to 8,192 tokens.
Intermediate Checkpoint Benchmarks (Step 32,000)
Current training progress: 17.2 Billion tokens.
1. Capability Benchmarks
Evaluated using log-probability choice prediction.
| Benchmark Task | Step 32k Accuracy | Random Baseline | vs. Baseline | Description |
|---|---|---|---|---|
| MMLU - Elementary Math | 32.00% | 25.00% | +7.00% | Elementary mathematics multiple-choice reasoning |
| MMLU - College CS | 28.00% | 25.00% | +3.00% | College-level computer science knowledge |
| OpenBookQA | 26.00% | 25.00% | +1.00% | Elementary science QA |
| ARC - Easy | 24.00% | 25.00% | -1.00% | Common grade-school science QA |
2. Hardware Serving Benchmarks (NVIDIA L4 GPU)
Measured using JAX/Flax optimized serving loop with static shape pre-allocations and GPU-native sampling.
| Metric | Value | Notes |
|---|---|---|
| Autoregressively Decode Speed | 75.32 tokens/sec | Locked performance at the L4 memory bandwidth ceiling |
| Time-To-First-Token (TTFT) | 17.0 ms | Instantaneous first-token latency |
| Warm Prefill Time | 3.9 ms | Enabled by bucketed padding |
| Active VRAM Usage | 5.02 GB | Out of 24 GB L4 memory |
Inference Example
Here is a simple example of how to load the model parameters and generate text using JAX:
# Initialize generator (downloads checkpoint parameters from HF hub automatically)
from zenyx_v3_inference import ZenyxGenerator
generator = ZenyxGenerator(step=32000)
prompt = "Explain the concept of neural networks in simple terms:"
response = generator.generate(
prompt,
max_new_tokens=100,
temperature=0.7,
repetition_penalty=1.25
)
print(response)