Draft/Speculative decoding
Heh been looking for one to try out and i hadn't seen your stuff since the command-r v2 stuff a few months back. Time to give this a try :)
I'll give my impressions in a bit once i get a chance to try this.
Ahhh.... i see.. command-a isn't yours, you just made the draft model. Sorry for the confusion. oh well.
No worries :)
Well i tried it anyways. But... the extra 40B (of the base model) is just too much so it runs at like 0.5T/s. 100B+ models are just too large/slow to be useful. even when reduced to Q2.
I mean it might work considerably better on the 16Gb video card i have but hard to say. We really need more VRam and consumer friendly options for these larger models over 70B.
Anyways. I'll still keep an eye out for your other -r writer models you might put out :)