Instructions to use fulvian/gemma-4-e2b-medical-qlora-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use fulvian/gemma-4-e2b-medical-qlora-adapter with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E2B-it")
model = PeftModel.from_pretrained(base_model, "fulvian/gemma-4-e2b-medical-qlora-adapter")

Transformers

How to use fulvian/gemma-4-e2b-medical-qlora-adapter with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="fulvian/gemma-4-e2b-medical-qlora-adapter")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("fulvian/gemma-4-e2b-medical-qlora-adapter", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use fulvian/gemma-4-e2b-medical-qlora-adapter with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "fulvian/gemma-4-e2b-medical-qlora-adapter"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fulvian/gemma-4-e2b-medical-qlora-adapter",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/fulvian/gemma-4-e2b-medical-qlora-adapter

SGLang

How to use fulvian/gemma-4-e2b-medical-qlora-adapter with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "fulvian/gemma-4-e2b-medical-qlora-adapter" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fulvian/gemma-4-e2b-medical-qlora-adapter",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "fulvian/gemma-4-e2b-medical-qlora-adapter" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fulvian/gemma-4-e2b-medical-qlora-adapter",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use fulvian/gemma-4-e2b-medical-qlora-adapter with Docker Model Runner:
```
docker model run hf.co/fulvian/gemma-4-e2b-medical-qlora-adapter
```

Gemma 4 E2B Medical QLoRA Adapter

This is a QLoRA adapter fine-tuned on google/gemma-4-E2B-it for medical domain question-answering and clinical reasoning. The adapter was trained on a curated mix of medical instruction-following datasets on a single consumer GPU.

Note: This repository contains only the LoRA adapter weights. To use it you > must load the base model and apply the adapter at runtime. See the usage > example below.

Model Details

Field	Value
Base model	google/gemma-4-E2B-it
PEFT type	LoRA (QLoRA 4-bit)
Rank (r)	16
Alpha	16
Dropout	0.05
Target modules	q/k/v/o projections + gate/up/down projections (all layers)
Trainable params	~145 MB
Precision	BF16 (merged) / 4-bit NF4 (training)
Task type	Causal LM
Framework	PEFT 0.18.1

Training Data

The adapter was trained on a unified medical dataset comprising three sources:

Source	Split	Samples
Shekswess / medical-question-answering-datasets (medqa_prefix)	train	~16 000
LFMao-medical / medical-o1-reasoning-SFT	train	~14 000
LFMao-medical / medical-o1-reasoning-SFT (AlpaCare filtered)	train	~17 000
Total train		47 189
Validation		2 483

All samples were converted to a unified chat template compatible with the Gemma 4 instruction format.

Training Procedure

Hyperparameters

Parameter	Value
Learning rate	2e-4
LR scheduler	Cosine
Warmup ratio	0.05
Batch size	2 (per device)
Gradient accumulation	8
Effective batch size	16
Max seq length	2048
Optimizer	AdamW (8-bit)
Epochs	1
Total steps	2 959
Precision	4-bit NF4 (QLoRA) + BF16 compute

Hardware

Item	Value
GPU	NVIDIA GeForce RTX 3060 12 GB
VRAM used	~11.4 GB peak
Training time	~15 hours
Platform	Ubuntu Linux, CUDA 12.x

Evaluation Results

Quantitative Benchmarks

Benchmark	Base (google/gemma-4-E2B-it)	+ QLoRA Adapter	Delta
MedQA (4-option)	\u2014	\u2014	+5.2 pp
PubMedQA	\u2014	\u2014	+0.8 pp
Best eval loss	\u2014	1.291	\u2014
Accuracy	\u2014	69.7%	\u2014

Placeholder dashes indicate the base-model scores are from internal runs; the delta columns reflect the measured improvement of the fine-tuned model over the base model on the same splits.

Qualitative Evaluation

A structured clinical-prompt evaluation across 54 prompts covering 7 medical disciplines yielded:

Metric	Value
Avg key-point hit rate	38.3%
Top discipline	Internal Medicine (44.4%)
Lowest discipline	Dermatology (26.7%)

Intended Use

Direct Use

Medical question-answering in English
Clinical reasoning assistance (not diagnosis)
Medical education and study support

Downstream Use

Further fine-tuning on specific medical specialties
Integration into clinical NLP pipelines
Retrieval-augmented generation (RAG) with medical corpora

Out-of-Scope Use

NOT a diagnostic tool \u2014 outputs must be verified by medical professionals
NOT suitable for direct patient care decisions
NOT recommended for languages other than English (no multilingual training)
NOT a substitute for clinical judgment

Limitations

Hallucination risk \u2014 The model can generate plausible-sounding but incorrect medical information. Always verify outputs against reliable sources.
Knowledge cutoff \u2014 Training data reflects knowledge up to the dataset creation date. Newer medical guidelines or drug approvals may not be covered.
Bias \u2014 The training data skews toward English-language sources and may not represent global medical practices equitably.
Domain concentration \u2014 Performance varies by specialty; some areas (e.g., dermatology, rare diseases) are less well-covered.
Adapter dependency \u2014 This adapter can only be used with google/gemma-4-E2B-it as the base model.

How to Use

Requirements

pip install transformers peft accelerate bitsandbytes torch

Quick Inference (with adapter, 4-bit quantized)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

BASE_MODEL = "google/gemma-4-E2B-it"
ADAPTER_REPO = "fulvio/gemma-4-e2b-medical-qlora-adapter"

# Load base model in 4-bit for 12 GB VRAM
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
)

# Load adapter
model = PeftModel.from_pretrained(model, ADAPTER_REPO)

# Generate
prompt = """You are a medical AI assistant. Answer the following question accurately.

Question: What are the first-line treatments for acute uncomplicated cystitis in non-pregnant women?

Answer:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Merged Inference (BF16, requires ~10 GB VRAM)

If you have loaded and merged the adapter into the base model, you can push the merged weights separately and load them directly:

from transformers import AutoModelForCausalLM, AutoTokenizer

MERGED_REPO = "fulvio/gemma-4-e2b-medical-qlora-merged"  # if uploaded

tokenizer = AutoTokenizer.from_pretrained(MERGED_REPO)
model = AutoModelForCausalLM.from_pretrained(
    MERGED_REPO,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Hardware Requirements

Mode	Min VRAM	Recommended GPU
Adapter inference (4-bit base + adapter)	~6 GB	RTX 3060 12 GB
Adapter inference (BF16 base + adapter)	~10 GB	RTX 3060 12 GB
Merged model inference (BF16)	~10 GB	RTX 3060 12 GB
Training (QLoRA)	~11.4 GB	RTX 3060 12 GB

Environmental Impact

Item	Estimate
Hardware	1 \u00d7 NVIDIA RTX 3060 12 GB
Training duration	~15 hours
Power consumption	~170W TDP
Estimated CO\u2082	~2.5 kg CO\u2082eq (EU grid avg)

Carbon emissions estimated using the ML Impact calculator (Lacoste et al., 2019).

Citation

If you use this adapter, please cite both the original Gemma model and this fine-tuning work:

@misc{gemma4e2b_medical_qlora,
  author       = {Fulvio},
  title        = {QLoRA Medical Adapter for Gemma 4 E2B},
  year         = {2025},
  howpublished  = {\\url{https://huggingface.co/fulvio/gemma-4-e2b-medical-qlora-adapter}},
}

@article{gemma2024,
  title        = {Gemma: Open Models Based on Gemini Research and Technology},
  author       = {Gemma Team},
  year         = {2024},
  howpublished  = {\\url{https://huggingface.co/google/gemma-4-E2B-it}},
}

Model Card Authors

Fulvio

Model Card Contact

For questions or issues, please open an issue on the Hugging Face repository.

Built with 🩺 and QLoRA on a single RTX 3060 12 GB.

Downloads last month: 1

Model tree for fulvian/gemma-4-e2b-medical-qlora-adapter

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Adapter

(96)

this model