Instructions to use google/gemma-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-7b")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")

llama-cpp-python

How to use google/gemma-7b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="google/gemma-7b",
	filename="gemma-7b.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use google/gemma-7b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
llama-cli -hf google/gemma-7b

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
llama-cli -hf google/gemma-7b

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
./llama-cli -hf google/gemma-7b

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
./build/bin/llama-cli -hf google/gemma-7b

Use Docker

docker model run hf.co/google/gemma-7b

LM Studio
Jan

vLLM

How to use google/gemma-7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/google/gemma-7b

SGLang

How to use google/gemma-7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Ollama
How to use google/gemma-7b with Ollama:
```
ollama run hf.co/google/gemma-7b
```

Unsloth Studio

How to use google/gemma-7b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for google/gemma-7b to start chatting

Docker Model Runner
How to use google/gemma-7b with Docker Model Runner:
```
docker model run hf.co/google/gemma-7b
```

Lemonade

How to use google/gemma-7b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull google/gemma-7b

Run and chat with the model

lemonade run user.gemma-7b-{{QUANT_TAG}}

List all available models

lemonade list

8-bit precision error

#32

by saireddy - opened Feb 22, 2024

Discussion

saireddy

Feb 22, 2024

anyone facing this issue with A100 multi gpus
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
I am using "auto" for device map, still hitting this issue

erfanzar

Feb 22, 2024

can you provide the training code?

saireddy

Feb 22, 2024

model_id = "google/gemma-7b"
BitsAndBytesConfig int-4 config

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

print("initiating model download")

model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=bnb_config,
use_cache=False,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
device_map="auto", token=access_token)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
r=64,
bias="none",
task_type="CAUSAL_LM",
)
prepare model for training

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
from transformers import TrainingArguments
args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=15,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
gradient_checkpointing=True,

optim="paged_adamw_32bit",
logging_steps=100,
save_strategy="epoch",
learning_rate=2e-4,
bf16=True,
tf32=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
seed=42,
eval_steps=100,
lr_scheduler_type="cosine",
evaluation_strategy='epoch',
disable_tqdm=False,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
report_to="wandb",
run_name=run_name # disable tqdm since with packing values are in correct

)
from trl import SFTTrainer

max_seq_length = 2048 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
model=model,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
formatting_func=generate_prompt, # this will aplly the create_prompt mapping to all training and test dataset
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"]
)
trainer.train()

few reqs that might help
accelerate==0.27.2
transformers==4.38.1
trl==0.7.11
bitsandbytes==0.42.0

ybelkada

Feb 23, 2024

Hi @saireddy !

This is because you need to make sure your model fits the entire GPU instead of splitting it across all available devices. Can you try:

+ from accelerate import PartialState

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
-    device_map="auto", 
+   device_map={"": PartialState().process_index}
    token=access_token
)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=15,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=100,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    seed=42,
    eval_steps=100,
    lr_scheduler_type="cosine",
    evaluation_strategy='epoch',
    disable_tqdm=False,
    load_best_model_at_end=True,    
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    report_to="wandb",
    run_name=run_name # disable tqdm since with packing values are in correct
)
from trl import SFTTrainer

max_seq_length = 2048 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=generate_prompt, # this will aplly the create_prompt mapping to all training and test dataset
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"]
)
trainer.train()

the line device_map={"": PartialState().process_index} will make sure accelerate will force-set the model into the GPU of index i instead of splitting into multiple devices.
See the reasons behind it in this GitHub issue comment: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994

saireddy

Feb 23, 2024

Thanks @ybelkada . couple of follow up questions
i am using same template for other models like llama2 and it doesn't show same errors and also I am using 4-bit quantization but error talks about 8-bit precision. am I missing something here ? can you please share your thoughts on this?

i tried with the changes you have mentioned, surprisingly I am hitting out of memory issue even with 80GB A100 one.

tvatsa

Mar 14, 2024

Same here, I tried with the suggestion. It's running out of memory even with 4 A10 gpus.

alokkrsahu

Mar 25, 2024

I am facing the same issue.

CODE and ERROR BELOW:

trainer.train()

File "/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 331, in train
output = super().train(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1776, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1228, in prepare
result = tuple(
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1229, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1331, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

alokkrsahu

Mar 25, 2024

device_map={"": PartialState().process_index}

above changes give Cuda OOM issue. I am working with Gemma2b-it

deleted

Mar 30, 2024

I also tried the suggestion and I am having same problem with mistral 7B model which I am trying to finetune.

My code is

qunatization config

quantization_config = BitsAndBytesConfig(
load_in_4bit = True, # enable 4-bit quantization
bnb_4bit_quant_type = 'nf4', # information theoretically optimal dtype for normally distributed weights
bnb_4bit_use_double_quant = True, # quantize quantized weights //insert xzibit meme
bnb_4bit_compute_dtype = torch.bfloat16 # optimized fp format for ML
)

lora_config = LoraConfig(
r = 16, # the dimension of the low-rank matrices
lora_alpha = 8, # scaling factor for LoRA activations vs pre-trained weight activations
target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
lora_dropout = 0.05, # dropout probability of the LoRA layers
bias = 'none', # wether to train bias weights, set to 'none' for attention layers
task_type = 'SEQ_CLS'
)

from accelerate import PartialState

model = AutoModelForSequenceClassification.from_pretrained(
model_name,
quantization_config=quantization_config,
num_labels=len(le.classes_),
device_map={"": PartialState().process_index},
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.config.pad_token_id = tokenizer.pad_token_id

define training args

training_args = TrainingArguments(
output_dir = 'multilabel_classification',
learning_rate = 1e-4,
per_device_train_batch_size = 8, # tested with 16gb gpu ram
per_device_eval_batch_size = 8,
num_train_epochs = 10,
weight_decay = 0.01,
evaluation_strategy = 'epoch',
save_strategy = 'epoch',
load_best_model_at_end = True
)

label_weights = 1 - labels_encoded/labels_encoded.sum()

train

trainer = CustomTrainer(
model = model,
args = training_args,
train_dataset = tokenized_ds['train'],
eval_dataset = tokenized_ds['val'],
tokenizer = tokenizer,
data_collator = functools.partial(collate_fn, tokenizer=tokenizer),
compute_metrics = compute_metrics,
label_weights = torch.tensor(label_weights, device=model.device)
)

trainer.train()

deleted

Mar 30, 2024

BTW I am getting the original error mentioned - ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
I0000 00:00:1711804411.121734 181893 cpu_client.cc:373] TfrtCpuClient destroyed.

saireddy

Apr 12, 2024

i tried with new gemma 7b It model and still hitting the same issue

ManojShack

Apr 25, 2024

•

edited Apr 25, 2024

Hi everyone,

I encountered a similar error with llama3 - 7b. To address this issue, I tried the following solution:

the line device_map={"": PartialState().process_index} will make sure accelerate will force-set the model into the GPU of index i instead of splitting into multiple devices.
See the reasons behind it in this GitHub issue comment: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994

While this helped resolve the initial error, it has now led to an Out of Memory (OOM) issue. I find this situation somewhat unexpected, considering I am using 8 Nvidia A100 GPUs (each with 40GB of memory) and have never experienced OOM errors with this configuration when working with models of similar size. I am currently performing QLoRA during the fine-tuning process.

Following are the versions of libraries that I am using:
transformers =4.40.1
accelerate=0.30.0.dev0
trl=0.8.6
peft=0.10.0

I tried using device_map={"":0}, but I am still encountering an Out of Memory (OOM) error.

Here are my LoRA params:

Rank = 8
Alpha = 8
Using 4bit quantization while loading the base model.

Has anyone figured out a solution to this problem? Thanks in advance!

ychenNLP

Apr 30, 2024

Thanks @ybelkada . couple of follow up questions
i am using same template for other models like llama2 and it doesn't show same errors and also I am using 4-bit quantization but error talks about 8-bit precision. am I missing something here ? can you please share your thoughts on this?

i tried with the changes you have mentioned, surprisingly I am hitting out of memory issue even with 80GB A100 one.

I got the same error and hitting OOM after the change.
My training code works for llama2 but not llama3 on 4x A40.
Have you figured out a solution?

rupakdas18

May 21, 2024

•

edited May 22, 2024

This is how I solved my issue:
Before running my script, I ran the command below. In my case, I wanted to use either GPU 3, 4, or 5 (other GPUs were highly loaded by other users)

export CUDA_VISIBLE_DEVICES=3,4,5

Inside my Python script, I used these commands,

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
CUDA_LAUNCH_BLOCKING=1

desired_device = 0
torch.cuda.set_device(desired_device)
............

model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=compute_dtype,
quantization_config=bnb_config,
device_map = torch.cuda.set_device(desired_device)
)

....................

My guess is the system is now considering GPU 3 as GPU 0 (default GPU) because of the "export CUDA_VISIBLE_DEVICES=3,4,5" command. Because, after the export command, I tried to see all the available GPUs and system gave me this output,

Number of available GPUs: 3
GPU 0: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB
GPU 1: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB
GPU 2: NVIDIA A100-SXM4-80GB
Compute Capability: (8, 0)
Memory: 79.20 GB

Renu11

Google org Jun 27, 2024

Hi @saireddy , Could you please confirm if this issue is resolved or you are still facing the same issue? Thank you.

saireddy

Jul 2, 2024

above one worked for me, thanks @Renu11

saireddy changed discussion status to closed Jul 2, 2024

ctsang

Jul 3, 2024

@saireddy which solution above works for you? I had similar error but none of the suggestion of above work. Thanks

saireddy

Jul 10, 2024

@ctsang @Renu11 reducing the batch size has worked for me and here are the versions i used
accelerate==0.31.0
bitsandbytes==0.43.1
datasets==2.18.0
deepspeed==0.14.4
evaluate==0.4.1
peft==0.11.1
transformers==4.42.3
trl==0.9.4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment