RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
Paper
•
2501.08617
•
Published
•
10
🌐 Project Page | 📄 Paper | 🐙 GitHub
The model is based on Llama-3-8b and finetuned with RLHF on the marketplace environments.
This checkpoint is a Llama‑3‑8B model fine‑tuned with Reinforcement Learning from Human Feedback (RLHF) on realistic marketplace interactions. Please be aware that RLHF fine-tuning can inadvertently reinforce strategic deception or manipulative behaviors.
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "kaiquliang/Llama-3-8b-RLHF"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
For additional resources, including prompts and code, please visit our GitHub repository.
If you find this model useful, please cite our paper:
@article{liang2025rlhs,
title={Rlhs: Mitigating misalignment in rlhf with hindsight simulation},
author={Liang, Kaiqu and Hu, Haimin and Liu, Ryan and Griffiths, Thomas L and Fisac, Jaime Fern{\'a}ndez},
journal={arXiv preprint arXiv:2501.08617},
year={2025}
}
Base model
meta-llama/Meta-Llama-3-8B