discrepancy btw HF and vLLM + Bug
#19
by
vince62s
- opened
With HF:
For the recode there is a missing to(x.device) for positions_ids in modeling_ministral3.py
@torch
.no_grad()
@dynamic_rope_update # power user: used with advanced RoPE types (e.g. dynamic rope)
def forward(self, x, position_ids):
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
position_ids_expanded = position_ids[:, None, :].float().to(x.device)
Running script (greedy)
import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
model_id = "mistralai/Ministral-3-14B-Reasoning-2512"
tokenizer = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
# image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is the meaning of life.",
},
# {"type": "image_url", "image_url": {"url": image_url}},
],
},
]
tokenized = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True)
tokenized["input_ids"] = tokenized["input_ids"].to(device="cuda")
#tokenized["pixel_values"] = tokenized["pixel_values"].to(dtype=torch.bfloat16, device="cuda")
#image_sizes = [tokenized["pixel_values"].shape[-2:]]
import time
start_time = time.time()
output = model.generate(
**tokenized,
#image_sizes=image_sizes,
max_new_tokens=4096,
)[0]
n_toks = len(output[len(tokenized["input_ids"][0]):])
decoded_output = tokenizer.decode(output[len(tokenized["input_ids"][0]):])
print(decoded_output)
print(n_toks / (time.time() - start_time ))
Gives
Unrecognized keys in `rope_parameters` for 'rope_type'='yarn': {'max_position_embeddings'}
Fetching 6 files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 6/6 [00:00<00:00, 48770.98it/s]
Download complete: : 0.00B [00:00, ?B/s] | 0/6 [00:00<?, ?it/s]
Loading weights: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 585/585 [00:04<00:00, 129.22it/s, Materializing param=model.vision_tower.transformer.layers.23.ffn_norm.weight]
The tied weights mapping and config for this model specifies to tie model.language_model.embed_tokens.weight to lm_head.weight, but both are present in the checkpoints, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning
The question *"What is the meaning of life?"* is one of the most profound and debated topics in philosophy, religion, science, and personal reflection. There is no single, universally accepted answer, but here are some perspectives from different fields:
### **1. Philosophical Perspectives**
- **Existentialism (Sartre, Camus, Nietzsche):**
Life has no inherent meaningβitβs up to each individual to create their own purpose through choices, actions, and passions. *"Existence precedes essence."* (Sartre)
- *Albert Camus* argued that we must embrace the absurdity of life and rebel against it by living with passion and authenticity.
- **Stoicism (Marcus Aurelius, Epictetus):**
Meaning comes from living virtuously, accepting what we cannot control, and focusing on reason, self-discipline, and contributing to the greater good.
- **Absurdism (Camus):**
Life is inherently meaningless, but we can find joy in the struggle itself, rather than seeking a grand purpose.
- **Nihilism (Nietzsche, later existentialists):**
If God is dead (in a metaphorical sense), traditional meanings collapse, and we must create our own values. Nietzsche famously said, *"He who has a why to live can bear almost any how."*
### **2. Religious & Spiritual Views**
- **Christianity/Judaism/Islam:**
Lifeβs meaning is to serve, love, and connect with God, follow divine will, and prepare for an afterlife (heaven, nirvana, etc.).
- *"Love the Lord your God with all your heart, soul, and mind."* (Matthew 22:37)
- **Buddhism/Hinduism:**
Meaning comes from enlightenment (breaking free from suffering and the cycle of rebirth) through wisdom, compassion, and detachment from ego.
- *"The meaning of life is to end suffering."* (Buddha)
- **Taoism:**
Meaning is found in harmony with the *Tao* (the natural flow of the universe), living simply, and embracing spontaneity.
### **3. Scientific & Evolutionary Perspectives**
- **Biological Evolution (Darwin, Dawkins):**
From a purely scientific standpoint, lifeβs "purpose" is survival, reproduction, and passing on genes. Richard Dawkins calls this the *"selfish gene"* perspective.
- However, humans also seek meaning beyond mere survival (art, love, knowledge, etc.).
- **Cosmology & Physics:**
The universe is vast, indifferent, and likely without inherent meaningβbut that doesnβt mean *our* lives lack significance. Some scientists (like Carl Sagan) argue that meaning comes from curiosity, exploration, and connection.
### **4. Personal & Subjective Meaning**
Many people find meaning in:
- **Relationships** (love, family, friendship)
- **Creativity** (art, music, writing, innovation)
- **Contribution** (helping others, activism, legacy)
- **Growth** (learning, self-improvement, overcoming challenges)
- **Experience** (joy, beauty, adventure, mindfulness)
### **5. Modern & Pop Culture Takes**
- **Viktor Frankl (Holocaust survivor, psychiatrist):**
*"Life is never made unbearable by circumstances, but only by lack of meaning and purpose."* (From *Manβs Search for Meaning*)
- Meaning comes from suffering with purpose, love, and work.
- **Sam Harris (Neuroscientist/Philosopher):**
Meaning is a *human construct*βwe create it through values, goals, and connections, not through divine decree.
- **Pop Culture (e.g., *Rick and Morty*, *The Good Place*):**
Often plays with the idea that meaning is subjective, absurd, or even a cosmic jokeβbut that doesnβt negate the importance of finding *your* own.
### **A Practical Approach to Finding Meaning**
If you're asking this question, you're already on the path. Hereβs how some people find meaning:
1. **Engage deeply** β Love, create, learn, or contribute to something bigger than yourself.
2. **Embrace suffering** β Franklβs idea: meaning often emerges from struggle.
3. **Connect with others** β Relationships give life depth.
4. **Explore curiosity** β Science, art, philosophy, or adventure can fulfill the need for wonder.
5. **Accept uncertainty** β Life may not have a predefined meaning, but thatβs okayβyou get to define it.
### **Final Thought**
As the philosopher **Alan Watts** said:
*"The meaning of life is just to be alive. It is so plain and so obvious, and so incredibly obvious that we have to keep forgetting it."*
Or, as **Douglas Adams** humorously put it in *The Hitchhikerβs Guide to the Galaxy*:
*"The answer to the ultimate question of life, the universe, and everything is⦠42."* (But the real joke is that the question itself was never properly defined.)
**What do *you* think gives life meaning?** Thatβs the real question.</s>
You will notice the Unrecognized keys in rope_parameters for 'rope_type'='yarn': {'max_position_embeddings'}
When using vLLM:
from vllm import LLM, SamplingParams
def main():
# Initialize the model with a custom max_model_len
llm = LLM(
model="mistralai/Ministral-3-14B-Reasoning-2512",
max_model_len=8192, # <-- Set the total context length here (prompt + generated)
gpu_memory_utilization=0.95
)
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.0,
top_p=1.0,
max_tokens=4096 # how many tokens to generate
)
# Your prompts
prompts = [
"[INST]What is the meaning of life.[/INST]",
]
# Generate
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
print(f"Prompt: {output.prompt!r}")
print(f"Generated: {output.outputs[0].text!r}")
print("-" * 50)
if __name__ == '__main__':
main()
gives this:
INFO 12-30 14:25:03 [utils.py:253] non-default args: {'max_model_len': 8192, 'gpu_memory_utilization': 0.95, 'disable_log_stats': True, 'model': 'mistralai/Ministral-3-14B-Reasoning-2512'}
Unrecognized keys in `rope_parameters` for 'rope_type'='yarn': {'apply_yarn_scaling'}
`rope_parameters`'s factor field must be a float >= 1, got 16
`rope_parameters`'s beta_fast field must be a float, got 32
`rope_parameters`'s beta_slow field must be a float, got 1
INFO 12-30 14:25:04 [model.py:514] Resolved architecture: PixtralForConditionalGeneration
INFO 12-30 14:25:04 [model.py:1661] Using max model len 8192
INFO 12-30 14:25:04 [scheduler.py:230] Chunked prefill is enabled with max_num_batched_tokens=8192.
[2025-12-30 14:25:04] INFO tekken.py:195: Non special vocabulary size is 130072 with 1000 special tokens.
WARNING 12-30 14:25:06 [system_utils.py:136] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:11 [core.py:93] Initializing a V1 LLM engine (v0.13.0) with config: model='mistralai/Ministral-3-14B-Reasoning-2512', speculative_config=None, tokenizer='mistralai/Ministral-3-14B-Reasoning-2512', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=mistralai/Ministral-3-14B-Reasoning-2512, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=2055856) [2025-12-30 14:25:12] INFO tekken.py:195: Non special vocabulary size is 130072 with 1000 special tokens.
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:13 [parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.1.19:54823 backend=nccl
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:13 [parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:13 [gpu_model_runner.py:3562] Starting to load model mistralai/Ministral-3-14B-Reasoning-2512...
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:14 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:14 [weight_utils.py:527] No consolidated.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.71s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.71s/it]
(EngineCore_DP0 pid=2055856)
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:20 [default_loader.py:308] Loading weights took 5.62 seconds
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:21 [gpu_model_runner.py:3659] Model loading took 26.0381 GiB memory and 6.380093 seconds
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:21 [gpu_model_runner.py:4446] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 2 image items of the maximum feature size.
(EngineCore_DP0 pid=2055856) WARNING 12-30 14:25:21 [processing.py:1153] PixtralProcessorAdapter did not return `BatchFeature`. Make sure to match the behaviour of `ProcessorMixin` when implementing custom processors.
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:26 [backends.py:643] Using cache directory: /home/vincent/.cache/vllm/torch_compile_cache/bdf311ff19/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:26 [backends.py:703] Dynamo bytecode transform time: 4.22 s
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:32 [backends.py:226] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 1.020 s
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:32 [monitor.py:34] torch.compile takes 5.24 s in total
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:33 [gpu_worker.py:375] Available KV cache memory: 1.70 GiB
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:33 [kv_cache_utils.py:1291] GPU KV cache size: 11,104 tokens
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:33 [kv_cache_utils.py:1296] Maximum concurrency for 8,192 tokens per request: 1.36x
(EngineCore_DP0 pid=2055856) 2025-12-30 14:25:33,860 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=2055856) 2025-12-30 14:25:33,874 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 51/51 [00:04<00:00, 12.21it/s]
Capturing CUDA graphs (decode, FULL): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 35/35 [00:01<00:00, 18.59it/s]
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:40 [gpu_model_runner.py:4587] Graph capturing finished in 7 secs, took -1.01 GiB
(EngineCore_DP0 pid=2055856) INFO 12-30 14:25:40 [core.py:259] init engine (profile, create kv cache, warmup model) took 19.51 seconds
INFO 12-30 14:25:41 [llm.py:360] Supported tasks: ['generate']
Adding requests: 0%| | 0/1 [00:00<?, ?it/s]/home/vincent/miniconda3/envs/pt2.8/lib/python3.11/site-packages/mistral_common/tokens/tokenizers/tekken.py:462: FutureWarning: `get_control_token` is deprecated. Use `get_special_token` instead.
warnings.warn("`get_control_token` is deprecated. Use `get_special_token` instead.", FutureWarning)
Adding requests: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1572.08it/s]
Processed prompts: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:05<00:00, 5.85s/it, est. speed input: 2.74 toks/s, output: 57.13 toks/s]
Prompt: '[INST]What is the meaning of life.[/INST]'
Generated: " The meaning of life is a profound and complex question that has been explored by philosophers, scientists, and thinkers throughout history. It encompasses various perspectives, including existential, spiritual, and scientific. Here are a few key viewpoints:\n\n1. **Existential Perspective**: Many existentialist philosophers, such as Jean-Paul Sartre and Albert Camus, suggest that life has no inherent meaning, and it is up to each individual to create their own purpose. They emphasize the importance of personal choice, responsibility, and authenticity in shaping one's life.\n\n2. **Spiritual Perspective**: Many spiritual traditions, such as Buddhism, Christianity, and Hinduism, propose that the meaning of life is tied to spiritual growth, enlightenment, or union with a higher power. These traditions often emphasize concepts like love, compassion, self-realization, and the pursuit of truth.\n\n3. **Scientific Perspective**: From a scientific standpoint, the meaning of life can be seen as the pursuit of knowledge, understanding, and the betterment of humanity. Some scientists, like Carl Sagan, view life's purpose as the exploration of the universe, the search for truth, and the advancement of human civilization.\n\n4. **Humanistic Perspective**: Humanistic psychology, pioneered by figures like Carl Rogers and Abraham Maslow, suggests that the meaning of life is found in the fulfillment of human potential, personal growth, and the pursuit of happiness and well-being.\n\nUltimately, the meaning of life is subjective and can vary greatly depending on one's beliefs, values, and experiences. It is a question that invites personal reflection and exploration, and the answer may evolve over time as one grows and learns."
--------------------------------------------------
[rank0]:[W1230 14:25:47.068075591 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Both should use Greedy decoding hence give the same results.
The rope errors at the beg of the log are also an issue.
vince62s
changed discussion status to
closed
this was for the reasoning model, closing