Instructions to use Xenova/llama2.c-stories15M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Xenova/llama2.c-stories15M with Transformers.js:

// npm i @huggingface/transformers
import { pipeline } from '@huggingface/transformers';

// Allocate pipeline
const pipe = await pipeline('text-generation', 'Xenova/llama2.c-stories15M');

Transformers

How to use Xenova/llama2.c-stories15M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Xenova/llama2.c-stories15M")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Xenova/llama2.c-stories15M")
model = AutoModelForCausalLM.from_pretrained("Xenova/llama2.c-stories15M")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Xenova/llama2.c-stories15M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Xenova/llama2.c-stories15M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xenova/llama2.c-stories15M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Xenova/llama2.c-stories15M

SGLang

How to use Xenova/llama2.c-stories15M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Xenova/llama2.c-stories15M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xenova/llama2.c-stories15M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Xenova/llama2.c-stories15M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xenova/llama2.c-stories15M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Xenova/llama2.c-stories15M with Docker Model Runner:
```
docker model run hf.co/Xenova/llama2.c-stories15M
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Usage (Transformers.js)

If you haven't already, you can install the Transformers.js JavaScript library from NPM using:

npm i @huggingface/transformers

You can then use the model to generate text like this:

import { pipeline } from "@huggingface/transformers";

// Create a text-generation pipeline
const generator = await pipeline('text-generation', 'Xenova/llama2.c-stories15M');

const text = 'Once upon a time,';
const output = await generator(text);
console.log(output);
// [{ generated_text: "Once upon a time, there was a little girl named Lily. She loved to play outside in" }]

const output2 = await generator(text, { max_new_tokens: 50 });
console.log(output2);
// [{ generated_text: "Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, dark cloud in the sky. She knew it was going to rain soon.\nLily ran inside her house" }]

Downloads last month: 7,978

Safetensors

Model size

15.2M params

Tensor type

F32

Model tree for Xenova/llama2.c-stories15M

Quantizations

2 models