FastChat - error
Hi, since I updated fastchat to version 0.2.2 I can no longer make the 4-bit GPTQ work because I get this error:
python3 -m fastchat.serve.cli --model-path models/TheBloke_vicuna-7B-1.1-GPTQ-4bit-128g --wbits 4 --groupsize 128
usage: cli.py [-h] [--model-path MODEL_PATH] [--device {cpu,cuda,mps}] [--num-gpus NUM_GPUS] [--load-8bit]
[--conv-template CONV_TEMPLATE] [--temperature TEMPERATURE] [--max-new-tokens MAX_NEW_TOKENS] [--style {simple,rich}]
[--debug]
cli.py: error: unrecognized arguments: --wbits 4 --groupsize 128
How can I fix this? Thank you bye!
I'm confused. When has FastChat ever supported GPTQ? I didn't know it did. And I can't see any recent commits that would affect this.
It's text-generation-webui that supports GPTQ with arguments --wbits 4 --groupsize 128.
I think you might be confusing the two pieces of software?
You're right, but with the 0.1 fi FastChat version it was enough to enter a directory called "repository" where inside it was enough to do git clone of GPTQ for LLAMA, then launch the cuda setup and you were able to get GPTQ to go with FastChat, which I find much better than text-generation.
Below I paste a guide taken from Medium to do what I described above.
The problem is that with version 0.2.2 I can't do it anymore. The advantage is that FastChat + the 4bit model = super speed! If you try this GitHub https://github.com/thisserand/FastChat.git
You can still install FastChat 0.1
Oh, interesting. I suppose they must have borrowed some code from text-generation-webui then. I'd no idea that was possible. Sounds like their recent rewrite must have removed the ability.
Why not just run text-generation-webui instead?
Because I've tried both systems and I believe that FastChat is much faster and more accurate in the answers. I know it depends on the model, but I've done hundreds of tests and I'm sure FastChat is more orderly and accurate in the answers. Now I'm downloading your TheBloke/vicuna-13B-1.1-HF model that I can run without the 4bit. I'll tell you in a few minutes...
OK I just read the article you linked and now I understand. Martin Thissen made his own repo, merging GPTQ-for-LLaMa into FastChat. So you weren't using Fastchat, you were using a fork of FastChat. If you wanted this to continue to work, you'd need to wait for Martin to update for FastChat 0.2.2 - or update the code yourself.
It's good to hear FastChat is faster. But I can't see how it could possibly be more accurate, because they were literally using the same code for inference. With Martin's fork, it just calls GPTQ-for-LLaMA to do the inference, which is the same as what text-generation-webui does.
What prompts were you using? One explanation for a perceived increase in accuracy could be that FastChat automatically applied a suitable prompt template, but you've not used that same template when using text-generation-webui. When I query Vicuna, I use this prompt format:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
prompt goes here
### Response:"
text-generation-webui has a feature where you can define a template and then save it to be used for each request. It's in the bottom left of the inference UI.
I'll close this now as it's not related to my file. I hope you manage to get it working OK for your needs.