Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
In a Training Loop ๐
50
76
189
Stefano Fiorucci
PRO
anakin87
Follow
ucalyptus's profile picture
mac1e2's profile picture
yiningJuan's profile picture
187 followers
ยท
89 following
theanakin87
anakin87
stefano-fiorucci
AI & ML interests
Language Models: orchestration, post-training, GRPO, synthetic data... Contributing to Haystack LLM framework ๐๏ธ
Recent Activity
replied
to
their
post
about 9 hours ago
How LLM training with RL Environments works? It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env โโญ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling 2๏ธโฃ Each game scored with deterministic rewards (win, format, ...) 3๏ธโฃ Mean score computed across the group 4๏ธโฃ Each rollout's advantage = its score minus the group mean 5๏ธโฃ Model updated to favor trajectories above baseline ๐ Repeat For a deep dive, check out ๐ฑ https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs
reacted
to
their
post
with โค๏ธ
1 day ago
How LLM training with RL Environments works? It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env โโญ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling 2๏ธโฃ Each game scored with deterministic rewards (win, format, ...) 3๏ธโฃ Mean score computed across the group 4๏ธโฃ Each rollout's advantage = its score minus the group mean 5๏ธโฃ Model updated to favor trajectories above baseline ๐ Repeat For a deep dive, check out ๐ฑ https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs
posted
an
update
1 day ago
How LLM training with RL Environments works? It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env โโญ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling 2๏ธโฃ Each game scored with deterministic rewards (win, format, ...) 3๏ธโฃ Mean score computed across the group 4๏ธโฃ Each rollout's advantage = its score minus the group mean 5๏ธโฃ Model updated to favor trajectories above baseline ๐ Repeat For a deep dive, check out ๐ฑ https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs
View all activity
Organizations
anakin87
's datasets
12
Sort:ย Recently updated
anakin87/tictactoe-demo
Viewer
โข
Updated
5 days ago
โข
5
โข
40
anakin87/tictactoe-filtered
Viewer
โข
Updated
16 days ago
โข
174
โข
19
anakin87/tictactoe
Viewer
โข
Updated
16 days ago
โข
200
โข
22
anakin87/Qwen3-0.6B-tuned-alphabet-sort-eval
Viewer
โข
Updated
Sep 4, 2025
โข
15
โข
6
anakin87/Qwen3-0.6B-alphabet-sort-eval
Viewer
โข
Updated
Sep 4, 2025
โข
15
โข
17
anakin87/events-scheduling
Viewer
โข
Updated
Apr 26, 2025
โข
600
โข
89
โข
2
anakin87/evol-dpo-ita-reranked
Viewer
โข
Updated
Jan 14, 2025
โข
19.8k
โข
11
โข
5
anakin87/gemma-vs-gemma-preferences
Viewer
โข
Updated
Jan 14, 2025
โข
24.7k
โข
5
anakin87/fine-instructions-ita-70k
Viewer
โข
Updated
Jan 14, 2025
โข
69.9k
โข
26
โข
4
anakin87/FineTome-single-turn-dedup
Viewer
โข
Updated
Jan 11, 2025
โข
83.3k
โข
18
anakin87/tulu-3-sft-mixture-with-language
Viewer
โข
Updated
Dec 11, 2024
โข
939k
โข
47
anakin87/medrag-pubmed-chunk
Viewer
โข
Updated
Feb 25, 2024
โข
15.4k
โข
28