Instructions to use google/gemma-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-7b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b") model = AutoModelForCausalLM.from_pretrained("google/gemma-7b") - llama-cpp-python
How to use google/gemma-7b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="google/gemma-7b", filename="gemma-7b.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use google/gemma-7b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf google/gemma-7b # Run inference directly in the terminal: llama-cli -hf google/gemma-7b
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf google/gemma-7b # Run inference directly in the terminal: llama-cli -hf google/gemma-7b
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf google/gemma-7b # Run inference directly in the terminal: ./llama-cli -hf google/gemma-7b
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf google/gemma-7b # Run inference directly in the terminal: ./build/bin/llama-cli -hf google/gemma-7b
Use Docker
docker model run hf.co/google/gemma-7b
- LM Studio
- Jan
- vLLM
How to use google/gemma-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/google/gemma-7b
- SGLang
How to use google/gemma-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Ollama
How to use google/gemma-7b with Ollama:
ollama run hf.co/google/gemma-7b
- Unsloth Studio
How to use google/gemma-7b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for google/gemma-7b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for google/gemma-7b to start chatting
- Docker Model Runner
How to use google/gemma-7b with Docker Model Runner:
docker model run hf.co/google/gemma-7b
- Lemonade
How to use google/gemma-7b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull google/gemma-7b
Run and chat with the model
lemonade run user.gemma-7b-{{QUANT_TAG}}List all available models
lemonade list
8-bit precision error
anyone facing this issue with A100 multi gpus
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
I am using "auto" for device map, still hitting this issue
can you provide the training code?
model_id = "google/gemma-7b"
BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
print("initiating model download")
model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=bnb_config,
use_cache=False,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
device_map="auto", token=access_token)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
r=64,
bias="none",
task_type="CAUSAL_LM",
)
prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
from transformers import TrainingArguments
args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=15,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
optim="paged_adamw_32bit",
logging_steps=100,
save_strategy="epoch",
learning_rate=2e-4,
bf16=True,
tf32=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
seed=42,
eval_steps=100,
lr_scheduler_type="cosine",
evaluation_strategy='epoch',
disable_tqdm=False,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
report_to="wandb",
run_name=run_name # disable tqdm since with packing values are in correct
)
from trl import SFTTrainer
max_seq_length = 2048 # max sequence length for model and packing of the dataset
trainer = SFTTrainer(
model=model,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
formatting_func=generate_prompt, # this will aplly the create_prompt mapping to all training and test dataset
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"]
)
trainer.train()
few reqs that might help
accelerate==0.27.2
transformers==4.38.1
trl==0.7.11
bitsandbytes==0.42.0
Hi @saireddy !
This is because you need to make sure your model fits the entire GPU instead of splitting it across all available devices. Can you try:
+ from accelerate import PartialState
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
- device_map="auto",
+ device_map={"": PartialState().process_index}
token=access_token
)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
r=64,
bias="none",
task_type="CAUSAL_LM",
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
from transformers import TrainingArguments
args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=15,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
optim="paged_adamw_32bit",
logging_steps=100,
save_strategy="epoch",
learning_rate=2e-4,
bf16=True,
tf32=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
seed=42,
eval_steps=100,
lr_scheduler_type="cosine",
evaluation_strategy='epoch',
disable_tqdm=False,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
report_to="wandb",
run_name=run_name # disable tqdm since with packing values are in correct
)
from trl import SFTTrainer
max_seq_length = 2048 # max sequence length for model and packing of the dataset
trainer = SFTTrainer(
model=model,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
formatting_func=generate_prompt, # this will aplly the create_prompt mapping to all training and test dataset
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"]
)
trainer.train()
the line device_map={"": PartialState().process_index} will make sure accelerate will force-set the model into the GPU of index i instead of splitting into multiple devices.
See the reasons behind it in this GitHub issue comment: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994
Thanks @ybelkada . couple of follow up questions
i am using same template for other models like llama2 and it doesn't show same errors and also I am using 4-bit quantization but error talks about 8-bit precision. am I missing something here ? can you please share your thoughts on this?
i tried with the changes you have mentioned, surprisingly I am hitting out of memory issue even with 80GB A100 one.
Same here, I tried with the suggestion. It's running out of memory even with 4 A10 gpus.
I am facing the same issue.
CODE and ERROR BELOW:
trainer.train()
File "/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 331, in train
output = super().train(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1776, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1228, in prepare
result = tuple(
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1229, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1331, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
- device_map={"": PartialState().process_index}
above changes give Cuda OOM issue. I am working with Gemma2b-it
I also tried the suggestion and I am having same problem with mistral 7B model which I am trying to finetune.
My code is
qunatization config
quantization_config = BitsAndBytesConfig(
load_in_4bit = True, # enable 4-bit quantization
bnb_4bit_quant_type = 'nf4', # information theoretically optimal dtype for normally distributed weights
bnb_4bit_use_double_quant = True, # quantize quantized weights //insert xzibit meme
bnb_4bit_compute_dtype = torch.bfloat16 # optimized fp format for ML
)
lora_config = LoraConfig(
r = 16, # the dimension of the low-rank matrices
lora_alpha = 8, # scaling factor for LoRA activations vs pre-trained weight activations
target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
lora_dropout = 0.05, # dropout probability of the LoRA layers
bias = 'none', # wether to train bias weights, set to 'none' for attention layers
task_type = 'SEQ_CLS'
)
from accelerate import PartialState
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
quantization_config=quantization_config,
num_labels=len(le.classes_),
device_map={"": PartialState().process_index},
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.config.pad_token_id = tokenizer.pad_token_id
define training args
training_args = TrainingArguments(
output_dir = 'multilabel_classification',
learning_rate = 1e-4,
per_device_train_batch_size = 8, # tested with 16gb gpu ram
per_device_eval_batch_size = 8,
num_train_epochs = 10,
weight_decay = 0.01,
evaluation_strategy = 'epoch',
save_strategy = 'epoch',
load_best_model_at_end = True
)
label_weights = 1 - labels_encoded/labels_encoded.sum()
train
trainer = CustomTrainer(
model = model,
args = training_args,
train_dataset = tokenized_ds['train'],
eval_dataset = tokenized_ds['val'],
tokenizer = tokenizer,
data_collator = functools.partial(collate_fn, tokenizer=tokenizer),
compute_metrics = compute_metrics,
label_weights = torch.tensor(label_weights, device=model.device)
)
trainer.train()
BTW I am getting the original error mentioned - ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
I0000 00:00:1711804411.121734 181893 cpu_client.cc:373] TfrtCpuClient destroyed.
i tried with new gemma 7b It model and still hitting the same issue
Hi everyone,
I encountered a similar error with llama3 - 7b. To address this issue, I tried the following solution:
the line
device_map={"": PartialState().process_index}will make sureacceleratewill force-set the model into the GPU of indexiinstead of splitting into multiple devices.
See the reasons behind it in this GitHub issue comment: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994
While this helped resolve the initial error, it has now led to an Out of Memory (OOM) issue. I find this situation somewhat unexpected, considering I am using 8 Nvidia A100 GPUs (each with 40GB of memory) and have never experienced OOM errors with this configuration when working with models of similar size. I am currently performing QLoRA during the fine-tuning process.
Following are the versions of libraries that I am using:
transformers =4.40.1
accelerate=0.30.0.dev0
trl=0.8.6
peft=0.10.0
I tried using device_map={"":0}, but I am still encountering an Out of Memory (OOM) error.
Here are my LoRA params:
- Rank = 8
- Alpha = 8
- Using 4bit quantization while loading the base model.
Has anyone figured out a solution to this problem? Thanks in advance!
Thanks @ybelkada . couple of follow up questions
i am using same template for other models like llama2 and it doesn't show same errors and also I am using 4-bit quantization but error talks about 8-bit precision. am I missing something here ? can you please share your thoughts on this?i tried with the changes you have mentioned, surprisingly I am hitting out of memory issue even with 80GB A100 one.
I got the same error and hitting OOM after the change.
My training code works for llama2 but not llama3 on 4x A40.
Have you figured out a solution?
This is how I solved my issue:
Before running my script, I ran the command below. In my case, I wanted to use either GPU 3, 4, or 5 (other GPUs were highly loaded by other users)
export CUDA_VISIBLE_DEVICES=3,4,5
Inside my Python script, I used these commands,
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
CUDA_LAUNCH_BLOCKING=1
desired_device = 0
torch.cuda.set_device(desired_device)
............
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=compute_dtype,
quantization_config=bnb_config,
device_map = torch.cuda.set_device(desired_device)
)
....................
My guess is the system is now considering GPU 3 as GPU 0 (default GPU) because of the "export CUDA_VISIBLE_DEVICES=3,4,5" command. Because, after the export command, I tried to see all the available GPUs and system gave me this output,
Number of available GPUs: 3
GPU 0: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB
GPU 1: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB
GPU 2: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB