Zenyx V3 Base (1.5B Mixture-of-Experts)

Zenyx V3 is a highly efficient 1.5B parameter Mixture-of-Experts (MoE) foundation model designed for low-latency inference, high throughput, and context length extensibility.

Current Training Status: Zenyx V3 is currently under active pretraining on a diverse corpus of web text, mathematics, and programming code. The final pretrained base model and supervised fine-tuned (SFT) chat models will be released soon. Stay tuned!

Model Architecture

Zenyx V3 incorporates several cutting-edge architectural enhancements:

Sparse Mixture-of-Experts (MoE): 12 routed experts + 1 shared expert. Evaluates exactly 2 active experts per token using a differential Sinkhorn transport-based gating system.
Multi-head Latent Attention (MLA): Compresses the Key-Value (KV) cache into a low-rank latent subspace, reducing high-bandwidth memory (HBM) bandwidth bottlenecks and memory footprint.
Hyper-Connections: Optimized using Sinkhorn normalization iterations across residual connections to stabilize gradients at scale.
Progressive Context Extension: Precomputed YaRN and RoPE scaling supporting context lengths of up to 8,192 tokens.

Intermediate Checkpoint Benchmarks (Step 32,000)

Current training progress: 17.2 Billion tokens.

1. Capability Benchmarks

Evaluated using log-probability choice prediction.

Benchmark Task	Step 32k Accuracy	Random Baseline	vs. Baseline	Description
MMLU - Elementary Math	32.00%	25.00%	+7.00%	Elementary mathematics multiple-choice reasoning
MMLU - College CS	28.00%	25.00%	+3.00%	College-level computer science knowledge
OpenBookQA	26.00%	25.00%	+1.00%	Elementary science QA
ARC - Easy	24.00%	25.00%	-1.00%	Common grade-school science QA

2. Hardware Serving Benchmarks (NVIDIA L4 GPU)

Measured using JAX/Flax optimized serving loop with static shape pre-allocations and GPU-native sampling.

Metric	Value	Notes
Autoregressively Decode Speed	75.32 tokens/sec	Locked performance at the L4 memory bandwidth ceiling
Time-To-First-Token (TTFT)	17.0 ms	Instantaneous first-token latency
Warm Prefill Time	3.9 ms	Enabled by bucketed padding
Active VRAM Usage	5.02 GB	Out of 24 GB L4 memory

Inference Example

Here is a simple example of how to load the model parameters and generate text using JAX:

# Initialize generator (downloads checkpoint parameters from HF hub automatically)
from zenyx_v3_inference import ZenyxGenerator

generator = ZenyxGenerator(step=32000)

prompt = "Explain the concept of neural networks in simple terms:"
response = generator.generate(
    prompt,
    max_new_tokens=100,
    temperature=0.7,
    repetition_penalty=1.25
)
print(response)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support