Zenyx V3 Base (1.5B Mixture-of-Experts)

Zenyx V3 is a highly efficient 1.5B parameter Mixture-of-Experts (MoE) foundation model designed for low-latency inference, high throughput, and context length extensibility.

Current Training Status: Zenyx V3 is currently under active pretraining on a diverse corpus of web text, mathematics, and programming code. The final pretrained base model and supervised fine-tuned (SFT) chat models will be released soon. Stay tuned!

Model Architecture

Zenyx V3 incorporates several cutting-edge architectural enhancements:

  • Sparse Mixture-of-Experts (MoE): 12 routed experts + 1 shared expert. Evaluates exactly 2 active experts per token using a differential Sinkhorn transport-based gating system.
  • Multi-head Latent Attention (MLA): Compresses the Key-Value (KV) cache into a low-rank latent subspace, reducing high-bandwidth memory (HBM) bandwidth bottlenecks and memory footprint.
  • Hyper-Connections: Optimized using Sinkhorn normalization iterations across residual connections to stabilize gradients at scale.
  • Progressive Context Extension: Precomputed YaRN and RoPE scaling supporting context lengths of up to 8,192 tokens.

Intermediate Checkpoint Benchmarks (Step 32,000)

Current training progress: 17.2 Billion tokens.

1. Capability Benchmarks

Evaluated using log-probability choice prediction.

Benchmark Task Step 32k Accuracy Random Baseline vs. Baseline Description
MMLU - Elementary Math 32.00% 25.00% +7.00% Elementary mathematics multiple-choice reasoning
MMLU - College CS 28.00% 25.00% +3.00% College-level computer science knowledge
OpenBookQA 26.00% 25.00% +1.00% Elementary science QA
ARC - Easy 24.00% 25.00% -1.00% Common grade-school science QA

2. Hardware Serving Benchmarks (NVIDIA L4 GPU)

Measured using JAX/Flax optimized serving loop with static shape pre-allocations and GPU-native sampling.

Metric Value Notes
Autoregressively Decode Speed 75.32 tokens/sec Locked performance at the L4 memory bandwidth ceiling
Time-To-First-Token (TTFT) 17.0 ms Instantaneous first-token latency
Warm Prefill Time 3.9 ms Enabled by bucketed padding
Active VRAM Usage 5.02 GB Out of 24 GB L4 memory

Inference Example

Here is a simple example of how to load the model parameters and generate text using JAX:

# Initialize generator (downloads checkpoint parameters from HF hub automatically)
from zenyx_v3_inference import ZenyxGenerator

generator = ZenyxGenerator(step=32000)

prompt = "Explain the concept of neural networks in simple terms:"
response = generator.generate(
    prompt,
    max_new_tokens=100,
    temperature=0.7,
    repetition_penalty=1.25
)
print(response)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support