Instructions to use CohereLabs/aya-23-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CohereLabs/aya-23-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="CohereLabs/aya-23-8B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-23-8B")
model = AutoModelForCausalLM.from_pretrained("CohereLabs/aya-23-8B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use CohereLabs/aya-23-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "CohereLabs/aya-23-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/aya-23-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/CohereLabs/aya-23-8B

SGLang

How to use CohereLabs/aya-23-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "CohereLabs/aya-23-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/aya-23-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "CohereLabs/aya-23-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/aya-23-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use CohereLabs/aya-23-8B with Docker Model Runner:
```
docker model run hf.co/CohereLabs/aya-23-8B
```

Seems can not use gguf file with response_format setting.

by svjack - opened May 25, 2024

Discussion

svjack

May 25, 2024

•

edited Jun 4, 2024

llm = llama_cpp.Llama.from_pretrained(
    repo_id="bartowski/aya-23-8B-GGUF",
    filename="aya-23-8B-Q4_K_M.gguf",
    #tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained("CohereForAI/aya-23-8B"),
    verbose=False,
    n_gpu_layers = -1,
    n_ctx = 3060 * 3
)

prompt =  '''
将下面的json内容翻译成中文，并保留相应的json格式:
    {'problem_description': "Two space agencies, Galactic Explorations and Interstellar Missions, are discussing the potential of Planet X-31 for human colonization. Galactic Explorations claims that Planet X-31 is an ideal candidate due to its Earth-like atmosphere and abundant water resources. Interstellar Missions, however, argues that Planet X-31 is not suitable for colonization because of its high levels of radiation, which they claim would make it impossible for humans to survive there. Galactic Explorations counters this argument by stating that humans could develop technology to shield themselves from radiation in the future. Which statement best describes the fallacy in Galactic Explorations' argument?", 'additional_problem_info': "A) The fallacy is that Galactic Explorations assumes humans can develop technology to shield themselves from radiation without any evidence. \nB) The fallacy is that Interstellar Missions is incorrect about the high levels of radiation on Planet X-31. \nC) The fallacy is that Galactic Explorations believes Planet X-31 is the only planet suitable for human colonization. \nD) The fallacy is that Interstellar Missions doesn't believe in the potential of human technological advancements.", 'chain_of_thought': "Galactic Explorations' argument assumes that humans will be able to develop technology to shield themselves from radiation in the future. However, there is no evidence presented in the problem description to support this claim. Therefore, their argument contains a fallacy.", 'correct_solution': 'A) The fallacy is that Galactic Explorations assumes humans can develop technology to shield themselves from radiation without any evidence.'}
'''

from IPython.display import clear_output
messages = [
    {
        "role": "user",
        "content": prompt
    }
]
response = llm.create_chat_completion(
            messages=messages,
    response_format={
                    "type": "json_object",
                    "schema": {
                        "type": "object",
                        "properties": {
                            "problem_description": {"type": "string"},
                            "additional_problem_info": {"type": "string"},
                            "chain_of_thought": {"type": "string"},
                            "correct_solution": {"type": "string"},
                        },
                        "required": ["problem_description", "additional_problem_info", "chain_of_thought", "correct_solution"],
                    }
                },
                stream=True,
            )

req = ""
for chunk in response:
    delta = chunk["choices"][0]["delta"]
    if "content" not in delta:
        continue
    #print(delta["content"], end="", flush=True)
    req += delta["content"]
    clear_output(wait = True)
    print(req)

when I run this, python kernel died.
Can someone help me ?😊

svjack changed discussion status to closed May 25, 2024

svjack changed discussion status to open May 25, 2024

m0javad

May 29, 2024

You cannot do that with Llama cpp and you should do it with ONNX.
It is based on a T5 transformer.

alexrs changed discussion status to closed Jun 12, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment