Cross-layer transcoder for Qwen3-0.6B-Base

Blog | Technical Report | Feature Dashboard | BluelightAI

This is a cross-layer transcoder trained to interpret the activations of Qwen3-0.6B-Base. It can be used with the open source circuit-tracer library to build and interrogate attribution graphs for prompts. You can also explore the features on our dashboard.

What is a cross-layer transcoder?

A cross-layer transcoder is an interpreter model trained to extract sparsely-activating features from the activations of a Transformer model. Its encoder translates the input xβ„“inx^{\text{in}}_{\ell} of each MLP layer of the Transformer into a high-dimensional but sparse feature vector fβ„“f_\ell. The decoder then reconstructs the output xβ„“outx^{\text{out}}_{\ell} of the MLP using features that were extracted from all previous layers. In formulas:

fβ„“=Οƒ(Wβ„“encxβ„“in+bβ„“enc)f_\ell = \sigma(W^{\text{enc}}_\ell x^{\text{in}}_{\ell} + b^{\text{enc}}_\ell) x^β„“out=βˆ‘k≀ℓWkβ†’β„“decfk+bβ„“dec\hat{x}^{\text{out}}_\ell = \sum_{k \leq \ell} W^{\text{dec}}_{k \to \ell} f_k + b^{\text{dec}}_\ell

where Οƒ\sigma is a sparsity-encouraging activation function.

The model is trained with a reconstruction loss alongside an auxiliary loss to encourage sparsity, roughly:

L(xin,xout)=βˆ₯x^outβˆ’xoutβˆ₯22+Ξ»βˆ₯tanh⁑(Ξ±f)βˆ₯1\mathcal L(x^{\text{in}}, x^{\text{out}}) = \|\hat{x}^{\text{out}} - x^{\text{out}}\|_2^2 + \lambda \|\tanh(\alpha f)\|_1

See our technical report for more details.

Model Details

This model is a cross-layer transcoder with 20480 features per layer (an expansion factor of 20x), using a JumpReLU activation function. It attains an L0 sparsity across layers of approximately 115, with about 23% of variance in MLP outputs unexplained.

The model was trained on approximately 750 million tokens of text from a broad range of domains, including general web text, public domain books, scientific articles, and code. Source datasets include:

Usage

This CLT can be used with the circuit-tracer library to generate attribution graphs.

Note: To load the CLT in circuit-tracer with the Qwen3-0.6B-Base model, you will need to install a patched version of Transformer Lens, as the current release does not support the Qwen3-Base models. Such a patched version is available here. The CLT can be used with an unpatched Transformer Lens if you use the Qwen/Qwen3-0.6B model instead.

You can load the CLT in circuit-tracer as follows:

import torch
from circuit_tracer import ReplacementModel
model_name = "Qwen/Qwen3-0.6B-Base" # Or just "Qwen/Qwen3-0.6B"
transcoder_name = "bluelightai/clt-qwen3-0.6b-base-20k"
clt = ReplacementModel.from_pretrained(model_name, transcoder_name, dtype=torch.bfloat16)

See this Colab notebook for a complete example.

Weight format

The model weights are sharded by layer across multiple files. The W_enc_{layer}.safetensors file contains the following named tensors:

  • W_enc_{layer}: encoder weights, shape (d_latent, d_model)
  • b_enc_{layer}: encoder biases, shape d_latent,
  • threshold_{layer}: JumpReLU thresholds, shape d_latent
  • b_dec_{layer}: decoder biases, shape d_model.

The W_dec_{layer}.safetensors file has a single tensor named W_dec_{layer}, containing the weights decoding that layer's features to the outputs of all subsequent layers, with shape (d_latent, n_out_layers, d_model).

Contact

You can reach us at [email protected] with any questions or inspiration.

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bluelightai/clt-qwen3-0.6b-base-20k

Finetuned
(493)
this model

Collection including bluelightai/clt-qwen3-0.6b-base-20k