Instructions to use fulvian/gemma-4-e2b-medical-qlora-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use fulvian/gemma-4-e2b-medical-qlora-adapter with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E2B-it") model = PeftModel.from_pretrained(base_model, "fulvian/gemma-4-e2b-medical-qlora-adapter") - Transformers
How to use fulvian/gemma-4-e2b-medical-qlora-adapter with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="fulvian/gemma-4-e2b-medical-qlora-adapter") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("fulvian/gemma-4-e2b-medical-qlora-adapter", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use fulvian/gemma-4-e2b-medical-qlora-adapter with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "fulvian/gemma-4-e2b-medical-qlora-adapter" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fulvian/gemma-4-e2b-medical-qlora-adapter", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/fulvian/gemma-4-e2b-medical-qlora-adapter
- SGLang
How to use fulvian/gemma-4-e2b-medical-qlora-adapter with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "fulvian/gemma-4-e2b-medical-qlora-adapter" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fulvian/gemma-4-e2b-medical-qlora-adapter", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "fulvian/gemma-4-e2b-medical-qlora-adapter" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "fulvian/gemma-4-e2b-medical-qlora-adapter", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use fulvian/gemma-4-e2b-medical-qlora-adapter with Docker Model Runner:
docker model run hf.co/fulvian/gemma-4-e2b-medical-qlora-adapter
Gemma 4 E2B Medical QLoRA Adapter
This is a QLoRA adapter fine-tuned on google/gemma-4-E2B-it for medical domain question-answering and clinical reasoning. The adapter was trained on a curated mix of medical instruction-following datasets on a single consumer GPU.
Note: This repository contains only the LoRA adapter weights. To use it you > must load the base model and apply the adapter at runtime. See the usage > example below.
Model Details
| Field | Value |
|---|---|
| Base model | google/gemma-4-E2B-it |
| PEFT type | LoRA (QLoRA 4-bit) |
| Rank (r) | 16 |
| Alpha | 16 |
| Dropout | 0.05 |
| Target modules | q/k/v/o projections + gate/up/down projections (all layers) |
| Trainable params | ~145 MB |
| Precision | BF16 (merged) / 4-bit NF4 (training) |
| Task type | Causal LM |
| Framework | PEFT 0.18.1 |
Training Data
The adapter was trained on a unified medical dataset comprising three sources:
| Source | Split | Samples |
|---|---|---|
| Shekswess / medical-question-answering-datasets (medqa_prefix) | train | ~16 000 |
| LFMao-medical / medical-o1-reasoning-SFT | train | ~14 000 |
| LFMao-medical / medical-o1-reasoning-SFT (AlpaCare filtered) | train | ~17 000 |
| Total train | 47 189 | |
| Validation | 2 483 |
All samples were converted to a unified chat template compatible with the Gemma 4 instruction format.
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 2e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 0.05 |
| Batch size | 2 (per device) |
| Gradient accumulation | 8 |
| Effective batch size | 16 |
| Max seq length | 2048 |
| Optimizer | AdamW (8-bit) |
| Epochs | 1 |
| Total steps | 2 959 |
| Precision | 4-bit NF4 (QLoRA) + BF16 compute |
Hardware
| Item | Value |
|---|---|
| GPU | NVIDIA GeForce RTX 3060 12 GB |
| VRAM used | ~11.4 GB peak |
| Training time | ~15 hours |
| Platform | Ubuntu Linux, CUDA 12.x |
Evaluation Results
Quantitative Benchmarks
| Benchmark | Base (google/gemma-4-E2B-it) | + QLoRA Adapter | Delta |
|---|---|---|---|
| MedQA (4-option) | \u2014 | \u2014 | +5.2 pp |
| PubMedQA | \u2014 | \u2014 | +0.8 pp |
| Best eval loss | \u2014 | 1.291 | \u2014 |
| Accuracy | \u2014 | 69.7% | \u2014 |
Placeholder dashes indicate the base-model scores are from internal runs; the delta columns reflect the measured improvement of the fine-tuned model over the base model on the same splits.
Qualitative Evaluation
A structured clinical-prompt evaluation across 54 prompts covering 7 medical disciplines yielded:
| Metric | Value |
|---|---|
| Avg key-point hit rate | 38.3% |
| Top discipline | Internal Medicine (44.4%) |
| Lowest discipline | Dermatology (26.7%) |
Intended Use
Direct Use
- Medical question-answering in English
- Clinical reasoning assistance (not diagnosis)
- Medical education and study support
Downstream Use
- Further fine-tuning on specific medical specialties
- Integration into clinical NLP pipelines
- Retrieval-augmented generation (RAG) with medical corpora
Out-of-Scope Use
- NOT a diagnostic tool \u2014 outputs must be verified by medical professionals
- NOT suitable for direct patient care decisions
- NOT recommended for languages other than English (no multilingual training)
- NOT a substitute for clinical judgment
Limitations
- Hallucination risk \u2014 The model can generate plausible-sounding but incorrect medical information. Always verify outputs against reliable sources.
- Knowledge cutoff \u2014 Training data reflects knowledge up to the dataset creation date. Newer medical guidelines or drug approvals may not be covered.
- Bias \u2014 The training data skews toward English-language sources and may not represent global medical practices equitably.
- Domain concentration \u2014 Performance varies by specialty; some areas (e.g., dermatology, rare diseases) are less well-covered.
- Adapter dependency \u2014 This adapter can only be used with
google/gemma-4-E2B-itas the base model.
How to Use
Requirements
pip install transformers peft accelerate bitsandbytes torch
Quick Inference (with adapter, 4-bit quantized)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
BASE_MODEL = "google/gemma-4-E2B-it"
ADAPTER_REPO = "fulvio/gemma-4-e2b-medical-qlora-adapter"
# Load base model in 4-bit for 12 GB VRAM
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
quantization_config=bnb_config,
device_map="auto",
)
# Load adapter
model = PeftModel.from_pretrained(model, ADAPTER_REPO)
# Generate
prompt = """You are a medical AI assistant. Answer the following question accurately.
Question: What are the first-line treatments for acute uncomplicated cystitis in non-pregnant women?
Answer:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Merged Inference (BF16, requires ~10 GB VRAM)
If you have loaded and merged the adapter into the base model, you can push the merged weights separately and load them directly:
from transformers import AutoModelForCausalLM, AutoTokenizer
MERGED_REPO = "fulvio/gemma-4-e2b-medical-qlora-merged" # if uploaded
tokenizer = AutoTokenizer.from_pretrained(MERGED_REPO)
model = AutoModelForCausalLM.from_pretrained(
MERGED_REPO,
torch_dtype=torch.bfloat16,
device_map="auto",
)
Hardware Requirements
| Mode | Min VRAM | Recommended GPU |
|---|---|---|
| Adapter inference (4-bit base + adapter) | ~6 GB | RTX 3060 12 GB |
| Adapter inference (BF16 base + adapter) | ~10 GB | RTX 3060 12 GB |
| Merged model inference (BF16) | ~10 GB | RTX 3060 12 GB |
| Training (QLoRA) | ~11.4 GB | RTX 3060 12 GB |
Environmental Impact
| Item | Estimate |
|---|---|
| Hardware | 1 \u00d7 NVIDIA RTX 3060 12 GB |
| Training duration | ~15 hours |
| Power consumption | ~170W TDP |
| Estimated CO\u2082 | ~2.5 kg CO\u2082eq (EU grid avg) |
Carbon emissions estimated using the ML Impact calculator (Lacoste et al., 2019).
Citation
If you use this adapter, please cite both the original Gemma model and this fine-tuning work:
@misc{gemma4e2b_medical_qlora,
author = {Fulvio},
title = {QLoRA Medical Adapter for Gemma 4 E2B},
year = {2025},
howpublished = {\\url{https://huggingface.co/fulvio/gemma-4-e2b-medical-qlora-adapter}},
}
@article{gemma2024,
title = {Gemma: Open Models Based on Gemini Research and Technology},
author = {Gemma Team},
year = {2024},
howpublished = {\\url{https://huggingface.co/google/gemma-4-E2B-it}},
}
Model Card Authors
Fulvio
Model Card Contact
For questions or issues, please open an issue on the Hugging Face repository.
Built with 🩺 and QLoRA on a single RTX 3060 12 GB.
- Downloads last month
- 1