Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper
•
2305.18290
•
Published
•
64
CurtGPTUsing Microsoft's Phi 1.5 model like it was never intended. |
|
This model is an adapter on puffin phi v2 trained using QLoRA and DPO on 60,000 samples from the anthropic helpful only dataset.
The following bitsandbytes quantization config was used during training: