Instructions to use litert-community/Qwen3-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use litert-community/Qwen3-4B with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=litert-community/Qwen3-4B \ model.litertlm \ --prompt="Write me a poem"
- Notebooks
- Google Colab
- Kaggle
Qwen3-4B LiteRT-LM Models
This repository contains LiteRT-LM variants of Qwen/Qwen3-4B optimized for on-device text generation.
Available Artifacts
| File | Quantization | Context | Size |
|---|---|---|---|
qwen3_4b_channelwise_int8_float32kv.litertlm |
channel-wise INT8 weights, float32 KV | - | 5.28 GB |
qwen3_4b_mixed_int4.litertlm |
TorchAO mixed INT4, float KV | 2048 | 2535.88 MiB |
Conversion Notes
The mixed INT4 .litertlm artifact was produced with a TorchAO-based quantize-first recipe from the original Hugging Face checkpoint. This is a mixed quantization layout rather than a uniform all-INT4 model: eligible linear projection weights are stored as blockwise INT4 with group size 32 and floating-point scales, token embedding weights use weight-only INT8 quantization, and normalization/reduction paths plus KV cache tensors remain floating point.
The mixed INT4 bundle also uses LiteRT-LM StableHLO composite ops for attention/cache execution, including odml.runtime_bmm and odml.cache_update.
Performance
Desktop benchmark: AMD Radeon AI PRO R9700, LiteRT-LM WebGPU, 256 prefill tokens, 32 decode tokens. Android rows use LiteRT-LM v0.13.1 with GPU OpenCL, 256 prefill tokens, and 64 decode tokens. Values report the warmed iteration from a two-iteration run unless noted.
Hardware benchmark disclosure: Results were measured by us on retail devices purchased through normal channels. These results are not affiliated with, sponsored by, endorsed by, or verified by Samsung, vivo, Qualcomm, MediaTek, Google, MLCommons, or Hugging Face. Results depend on device SKU, OS build, thermal state, battery mode, backend, model quantization, runtime version, and benchmark settings.
| Device / Backend | Prefill (tok/s) | Decode (tok/s) | TTFT (s) | Peak Private Footprint |
|---|---|---|---|---|
| Desktop GPU WebGPU | 1327.33 | 87.52 | 0.20 | 1697 MB |
| Samsung SM-S937U1 GPU OpenCL | 357.57 | 19.14 | 0.77 | 1609 MB |
| vivo V2502A GPU OpenCL | 170.66 | 12.67 | 1.58 | 4722 MB |
| TECNO LJ9 GPU OpenCL | 104.37 | 10.86 | 2.54 | 4906 MB |
Try It
Install uv and run:
uv tool install litert-lm
uvx litert-lm run --from-huggingface-repo=litert-community/Qwen3-4B qwen3_4b_mixed_int4.litertlm --prompt="What is the capital of France?"
Integration
Ready to integrate this into your product? Get started in the LiteRT-LM documentation.
Citation
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}
- Downloads last month
- 4,188