Instructions to use mlabonne/NeuralMarcoro14-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mlabonne/NeuralMarcoro14-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mlabonne/NeuralMarcoro14-7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mlabonne/NeuralMarcoro14-7B")
model = AutoModelForCausalLM.from_pretrained("mlabonne/NeuralMarcoro14-7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use mlabonne/NeuralMarcoro14-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mlabonne/NeuralMarcoro14-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mlabonne/NeuralMarcoro14-7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mlabonne/NeuralMarcoro14-7B

SGLang

How to use mlabonne/NeuralMarcoro14-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mlabonne/NeuralMarcoro14-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mlabonne/NeuralMarcoro14-7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mlabonne/NeuralMarcoro14-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mlabonne/NeuralMarcoro14-7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mlabonne/NeuralMarcoro14-7B with Docker Model Runner:
```
docker model run hf.co/mlabonne/NeuralMarcoro14-7B
```

Congrats!

by CultriX - opened Jan 8, 2024

Discussion

CultriX

Jan 8, 2024

Great result, congrats!
Although I can't help but feel you used my methods here... (lol, joke had to be made I'm sorry)

Thanks for sharing!

mlabonne

Owner Jan 8, 2024

Hahaha thanks @CultriX ! :)

CultriX

Jan 8, 2024

•

edited Jan 8, 2024

I do wonder though: it seems like yours (whilst performing good overall, let there be no doubts about that) does see the steepest increase in performance in the GSM8K benchmark.

And as somebody rightfully pointed out on my model page: The intel neural chat data includes GSM8k, which is also part of the leaderboard test.

As you know im really new to all of this so I am actually not quite sure how big of a difference this would make and how much it would influence

1. Benchmarking results

(and more importantly:)

1. How that would translate to actual model performance versus expected performance based on the benchmarking results.

Could you chime in on that?
Would it make a substantial difference in either results or in relationship to actual model performance in a real scenario?

mlabonne

Owner Jan 8, 2024

Isn't it data from the training split of GSM8k? I don't think that the neural chat data is contaminated (but I might be wrong). If it's really test data, it makes the dataset absolutely useless :(

I don't completely rely on the Open LLM Leaderboard and I use another benchmark suite (with https://github.com/mlabonne/llm-autoeval) for this purpose. It doesn't include GSM8k.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment